Master-Stable Big Tech Tightrope
The goal of this article is to provide an outline on how commits, code management and builds work in big tech companies. This does not apply to small companies.
Terms
- Master branch: The main branch where all developers commit their changes.
- Stable branch: A separate “virtual” branch used for building artefacts and checking the overall stability of all apps in the repo.
- Artefacts: Pre-built components or libraries that can be pulled from e.g. network instead of building locally.
Setup
- Thousands+ of engineers to collaborate within a mono repo repository.
- The commit frequency could be high, often exceeding 50 commits per minute.
- Complex dependency graphs where each change necessitates running a multitude of tests.
Given this scale and complexity, it becomes infeasible to maintain accurate artifacts and ensure correctness directly on the master branch. The computational cost and time required to verify every single commit properly is prohibitively expensive.
Stable Approach
- Engineers continuously commit to the master branch
- Periodically (typically when the latest “stable” build is marked as healthy), a new stable build is triggered from the most recent master commit
- All applications within the repository are built, and corresponding artifacts are generated for that stable branch
Engineers usually build from stable commits, not deviating too much. When you check out commit 8, for instance, you won’t have the cache from commit 10, but it’s more recent than commit 6. This scenario results in a cache miss, as some libraries may have changed between commits 6 and 8.
Regarding correctness and quality assurance, while we could run tests on individual commits (e.g., 1, 2, 3), this approach might miss issues that arise from the interaction of multiple commits. For example, while commits 1 and 3 might pass tests individually, their combination could introduce unexpected behavior. The stable build process allows us to catch these interaction-based issues, enhancing overall system reliability.
In large tech companies, the interval between stable builds could range from minutes to hours. However, this period can extend up to 24+ hours if the stable branch becomes blocked due to issues.
Conceptually, you can visualize master and stable as two moving pointers. The ideal scenario is to keep the stable pointer as close to the master as possible. This can be achieved through several strategies:
- Ensuring the stable branch remains healthy at all times
- Optimizing and reducing build times
- Carefully curating the test suite to remove non-critical tests from blocking the master branch
This approach allows for a balance between rapid development (on the master branch) and ensuring a consistently stable and reliable codebase (via the stable branch).
Unblocking stable
In the diagram, we can imagine a scenario where commit 8 has introduced a regression that wasn’t caught by the existing test suite, causing the stable build to fail. The build engineer responsible for monitoring the stable process must perform a binary search to isolate the offending commit (potentially 7, 8, or 9). Once identified, the problematic change is typically reverted or “unlanded” from the stable branch. This action allows the stable build process to resume, effectively moving the stable pointer forward and maintaining the integrity of the build pipeline.
The blockage interval between master and stable may extend up to 24 hours. Some companies have the auto-unland toolset but it may fail to find the correct commit to unland in some cases.
When the stable branch remains blocked for more than 24 hours, it’s usually a critical situation. In such cases, engineering teams may often initiate:
- A high-priority incident response / XFN effort to resolve/diagnose
- Escalation to senior engineering/chief sheriff of some kind
- Post-mortem or review to ensure measures are in place to prevent similar issues happening in the future
The goal is always to unblock the stable branch as quickly as possible so engineers can work off the stable and enjoy fast build times while having the correctness taken care of.