Git: learn how to manage your code
If you are a developer, chances are that sooner or later you will start working with lots of files, and if you distribute your code, chances are you will start working with lots of people too. When one or both will happen a question will arise: “How do I mange it?” Let’s talk about VCSs and Git.
Too big project, too many files
Little projects may contain a few dozens of files, bigger ones may include dozens of thousands. Managing all these files is complicated by itself, but what happens when you make a mistake? You suddenly realize that you have messed up an important portion of your code, and that your backup (I hope you did one) is somewhere buried in your filesystem. Worst case you have nothing. You have officially lost many hours of time trying to change your code for the worse. On top of that you will have to spend more hours on fixing what has been done.
One developer, ten developers
The past situation is already frightening as it is. But what happens when there are a few developers that write on that very same code? Two of the developers might decide to work on the same feature and modify the same lines; each one wants to get his code incorporated in the main project, but that’s not possible since they modified the very same things in different ways. Things can’t work like this, and the two developers have wasted time working on a thing that could’ve been developed by only one of them. That’s why Version Control Systems were invented to solve these very two problems.
Version Control Systems
A Version Control System (VCS) is a system that keeps track of Versions. Each version represents the state of the project at a defined time. Each versions contains the either changes you made to the the code since the last version or a full snapshot of the code. Let’s now analyse the past two problems:
- You realize you’ve made a mistake and a precedent version is better than the new one, that is not a problem: using a VCS you can easily roll-back to a precedent version without losing anything in a matter of seconds.
- “Developer A” and “Developer B” start both working on the “add salt” feature, but each one doesn’t know the other is doing the same. “Developer A” comes with the “add salt” feature done; around the same time “Developer B” comes with the “add salt” feature done, but he also created the “add pepper over salt” feature. “Developer A” created a slightly better version of “add salt” compared to “Developer B”. Using a VCS that is not a problem. The “Developer A”‘s “add salt” feature is included in the main project; “Developer B” looks at the code from “Developer A”, discards his own “add salt” and modifies his “add pepper over salt” to fit “Developer A”‘s version. Both get their code incorporated, effectively cooperating (next time let’s hope they use a public forum to announce they are working on that very same feature).
Centralized VS Distributed VCS
The first VCSs used a centralized approach to the problem. There is a central server that holds all the versions and serves many clients, each time the main project is changed, the clients have to be notified. But the principal problem is the single point of failure:
- If the server goes down in a Centralized VCS no one can operate the main project and no one can even access/contribute to it.
- If the storage medium of the server gets corrupted or is broken, backups are needed to fix things up.
The next iteration of VCS is called Distributed VCS. These aimed to fix the two problems outlined above. In the centralized approach only the server holds all the versions. In the distributed approach every single client can hold all the versions. Each client knows the full history of the project and act as a form of “backup” for the project. Even if the “central” server goes down, the clients can still operate and make changes looking at the whole project. Also if the “central” server of the project is damaged, one of the many clients can step in and offer its “backup” to the project’s authors.
Git is a project started in 2005 by Linus Torvalds (yes, the same person that created Linux). Linus didn’t really like the VCSs that were around that time (the story is a bit more complicated), so he started a new one to support the growth of his main project: Linux. Git is a distributed VCS that allow developers to store versions of their code called commits in a local database called repository stored in a subdirectory of the project called .git . Git uses checksums to ensure integrity and identify commits. Each commit is authored using SSH keys. Each repository contains one or multiple branches that allow multiple, parallel code (usually the same feature) to live in the same repository. The local directory where you can modify files is called working directory. In the working directory files can be modified by the developer and are not affected by other developers. Git also supports a pseudo-centralized workflow that allows push (incorporate my code into the main project) and pull (fetch all the versions from the server and put them on my machine).
Git is NOT GitHub
Beginners usually tend to think that Git and GitHub are the same thing. That’s not true at all! GitHub is a commercial site that provides a platform for developers built on top of Git. Git is independent from GitHub and can be used without an account to the latter. As a matter of fact you can use Git without any other tool, but your experience will be significantly better using another one like GitHub or GitLab.
Great, where do I start?
Starting with Git can be a bit painful, so don’t feel down if you don’t get everything in the first place. In the next weeks I will write a through guide to allow beginners to start moving the first steps in the Git world.
Latest posts by mark (see all)
- What is Big Data? - 15 January 2020
- What is Data Science? - 8 January 2020
- 2020: Trends and predictions for technology and IT - 1 January 2020