home » insights » delivering software faster how to build a scalable build system for a large monorepo

Delivering software faster – How to build a scalable build system for a large monorepo


Part 5 of 5
Dan Cohn, Sabre Labs

The previous posts in this series explored ways to improve developer productivity by using: 

  • a (Git-based) monorepo, 
  • a common build tool (Bazel), and 
  • a containerized development environment.

The pervading theme is consistency. When everyone has the same set of tools and follows roughly the same software development process, it becomes feasible to roll out best practices across a large organization. You can execute wide-scale refactoring and apply centralized optimizations. You gain economies of scale. Consistency also facilitates internal mobility across projects and teams. 

Use of a shared monorepo and common tools sets the stage for the perhaps the most significant benefit of all: a unified build system. Every software artifact from small to large requires a build system or, more precisely, a Continuous Integration (CI) system. Historically, each Sabre dev team managed its own unique CI process. Over time, teams converged on a common platform (Jenkins) with many projects leveraging a common set of pipeline scripts and libraries. Nonetheless, every product (and microservice) persisted in its own DevOps microcosm with dedicated systems and engineers maintaining those systems. Projects that started out using the same CI pipeline still ended up diverging from one another because they were all copies of an original. 

I want to make an important distinction between using a common blueprint and sharing an identical set of tools and code. It’s the difference between having one large, centralized factory churning out new versions of software and many smaller factories trying to do the same thing. At first glance, the distributed model may seem more flexible and agile. That is until you want to modernize, streamline, or add new capabilities to your factories. Sure, you can continually update the blueprints, but that will only benefit the new factories. How do you fix a pervasive issue or improve the efficiency of your existing factories? The only answer is one by one, and this tends to be prohibitively expensive. As a result, you apply a patchwork of improvements over time, causing the factories (i.e., build systems) to diverge and become more and more difficult to maintain. 

CI as a service 

At Sabre we call the single-factory model “CI as a Service.” Every application and microservice that resides in the Sabre monorepo relies on a common build system. Every line of code and documentation that supports CI as a Service inhabits in the same monorepo. When we add a new CI feature, it rolls out to everyone at once. When there’s a defect or vulnerability, we can resolve it in one place. 

Of course, creating and supporting a unified CI system has many challenges. One of the first and most complex is the matter of deciding which apps/services to build. This is straightforward in a many-repo environment. Whenever a new commit comes along, you simply rebuild every app in the repo. This is typically just one app or a small set of interrelated services — not so in a monorepo. Sabre’s nascent monorepo already houses over 600 buildable units of software. On the main branch alone, we have 50-100 new commits each weekday. If we ran every CI job for each new commit, we would have roughly 45,000 builds and deployments per day, the vast majority of which would be unnecessary. (And remember, this is just the main branch. Updates occur on thousands of topic branches.) 

Build dispatcher 

Determining which builds to run is the job of the Build Dispatcher. This is a special pipeline job that examines each new commit on a specific branch and decides which CI jobs to dispatch based on the set of changes in each commit. It also creates new CI jobs as new services show up in the repo. (In this context, a “service” is a buildable unit of code and may be an application, microservice, infrastructure, or even documentation.) 

The question you may now be thinking is, how do we know what constitutes a service? I’m glad you asked! Each service is specified by a custom Bazel rule that defines various pieces of information that CI requires. This includes, among other things, lists of software targets to build, unit tests to run, and cloud environments in which to deploy. The Build Dispatcher is most interested in the build targets, since it can translate them into a set of related source files. 

Here is a simplified example: 

service(
   name = “my-java-service”,
   build_targets = [“:springboot_service”],
   deployment_paths = { … },
   deployment_strategy = “…”,
   deployment_type = “container”,
   display_name = “My Java Service”,
   group_id = “examples”,
   service_owner = “developer.name@sabre.com”,
   unit_test_tags = [“unit”]
)

One of the reasons we use Bazel to define units of software is that it already knows what sources are required to compile any given executable or other type of output. This is called the build graph. With Bazel you can simply query the build graph to find out all the source code and dependencies for a given service. You can also run a reverse dependency query to find out which build targets are affected by a particular file in the repo. This includes both direct and indirect dependencies. 

In essence what the Build Dispatcher must do is figure out which service or services rely on the files contained in a change set (one or more commits). For various reasons, we use a forward dependency query to produce lists of files required by each service. This includes not only the source files associated with the service itself but also any dependent files in the repo such as shared libraries. Once the Dispatcher knows which files each service requires, it matches this up with the list of added, changed, and deleted files. Where there’s a match, there’s a CI job to run. 

It seems simple on paper, but the process is a fair bit more complicated. For example, many services require a Dockerfile that CI uses to build a deployable container image. Whenever someone updates this file, we want to rebuild the service. But Bazel doesn’t recognize the Dockerfile as a dependency because it isn’t a source file for the application. Therefore, our “service” must either list the Dockerfile as an explicit dependency, or the Build Dispatcher must be smart enough to recognize that certain types of services need a Dockerfile. 

Then you have files like the “WORKSPACE” that holds the coordinates of various Bazel rules and shared dependencies and the “service.bzl” file that defines our custom “service” rule. What do you do when one of these changes? Do you trigger CI jobs for every service in the monorepo? It’s ultimately a trade-off between risk and cost. You can err on the side of caution and rebuild everything under the sun, thereby ensuring that the change has no unexpected side effects (i.e., broken builds or functionality). Or you can rely on manual testing and save on cloud compute and related costs. 

Scaling considerations – part 1 

As you might imagine, the Build Dispatcher is a potential bottleneck for the entire CI and CD (Continuous Delivery) pipeline. The longer it takes for the dispatcher to complete its job, the more time elapses between code commit and deployment. Delays are particularly irritating when a CI job fails requiring someone to debug the issue and submit a fix. Waiting time can become costly. 

We don’t have the perfect solution (yet!) but have found numerous ways to streamline the process. For starters, there need not be only one Dispatcher. Every branch and/or pull request can have its own Dispatcher job, and these can easily run in parallel. Furthermore, when there are back-to-back commits on a branch – from a single push or several PR merges within a short span of time – the Dispatcher analyzes multiple commits as a batch rather than one at a time. Once it identifies the CI jobs that need to run, it isn’t difficult to reverse engineer which services were affected by which commit. (This is important for traceability and status notifications.) 

The analysis phase itself is more of a challenge. Multi-service analysis seems like a highly parallelizable task, and it could be, but there are limitations. Bazel sequences all client requests through a single server process. Although builds are multi-threaded, Bazel handles build graph queries one by one. Fortunately, Bazel has a powerful query language with “union” operators that enable many queries to be joined into one larger request. Bazel has no difficulty processing large queries, and it does so in much less time than it takes to process multiple smaller queries. We leverage this to perform broad searches of the dependency graph first and narrow down from there to identify individual services affected by a particular change set. 

Bazel also has a “warm up” period during which it loads libraries, analyzes files, and constructs the build graph. Once the first query is done, subsequent operations are significantly faster. For this reason, it’s best to keep the Bazel server running between jobs rather than starting fresh every time a new commit shows up. Of course, this has a cost/benefit tradeoff as well. 

Scaling considerations – part 2 

So far, we’ve only talked about the Build Dispatcher. Most of the work of CI happens in the “build jobs” themselves. How do you distribute these jobs to make the best use of whatever compute infrastructure you may have? The question contains a big part of the answer: The more you distribute the work the better. 

There are many approaches to CI scaling. We’ve found that the combination of Jenkins and Kubernetes is surprisingly scalable without breaking the bank. Jenkins has its limitations (and detractors), but it’s a good fit for us due to its widespread adoption within Sabre and an extensive array of freely available plug-ins. We’ve gone so far as to enhance some of these plug-ins to work better with a large monorepo. 

Each Jenkins job has its own dedicated agent (or agents). An agent is composed of multiple containers that run different portions of the CI pipeline in sequence or even in parallel. Through the magic of Kubernetes, these agent pods execute on a scalable cluster of compute nodes. The same cluster may be used for other workloads as well such as deployed applications and automated tests. 

A drawback of Jenkins is that it allows for only a single controller node per instance (with the non-enterprise edition). This presents both a reliability and a scalability challenge. Nevertheless, we are successfully running thousands of jobs on a single instance, with plans to distribute the load over multiple instances in the not-to-distant future. We’ve learned how to wrestle Jenkins into submission with a variety of tricks such as: 

  • Automated housekeeping jobs (e.g., to remove obsolete or outdated jobs) 
  • Discarding (or archiving) old build artifacts and history 
  • Web hooks for triggering Build Dispatcher jobs based on Git events 
  • Disabling health metrics in the UI 
  • Weekly maintenance 

Scaling considerations – part 3 

If you’re familiar with monorepos or read my earlier post about coaxing Git into working great with a monorepo, you know that cloning a large Git repository can be cumbersome if not done “correctly.” The same applies to CI. Imagine hundreds of CI jobs downloading the entire content of a multi-gigabyte repo multiple times per hour. Not only would this be extremely slow and consume a lot of disk space, but it would also put quite a strain on the network and Git servers. 

Thankfully, there are smarter ways to clone and check out files from a Git repo. The goal is to minimize the number of objects downloaded and files extracted from these objects. As mentioned in the previous post, you can accomplish this with: 

  • partial clones (also known as filtered clones), 
  • sparse checkouts, 
  • shallow clones (minimal commit depth), and 
  • minimal branch references. 

For CI jobs this takes a somewhat different form from everyday code checkouts. For instance, how do you know which directory paths to include in the sparse checkout? How do you determine the appropriate commit depth? Which branch references are required? 

Our approach to sparse checkouts is to begin with a limited set of directories such as those containing scripts required by the CI pipeline itself, Bazel rules and macros, and shared libraries. We then use a familiar tool – the Bazel query – to incrementally expand the sparse checkout to include all directories required to build a particular service and its dependencies. 

The Build Dispatcher works differently since it needs a view of all services in the repo. To accomplish this, it uses sparse checkout patterns like “*.bzl” and “*.bazel” in addition to directory paths. But what about the rest of the files? How can we figure out which source files belong to each service if they aren’t visible to the Dispatcher. For this we use a clever trick. Rather than checking out the entire repo, we create placeholders (with the command touch “path/to/file”) for the files included in a particular change set (i.e., one or more commits). After all, no other files are important to the analysis. This approach has the bonus of handling deleted files appropriately. For example, suppose someone accidentally deletes a source file required by one or more services. We want the associated CI job or jobs to flag the problem. 

Shallow clones are another interesting challenge for CI. The Dispatcher needs enough commit depth to analyze any new commits pushed since the last time it ran. A depth of 100 is generally sufficient in this case. Regular build jobs can get by with a depth of 1 because they only need to see the “current” code. But what about pull requests? This is an oft-overlooked facet of CI. When running CI against a PR, it’s a good idea to merge the source and destination branches to simulate how a set of changes will behave once integrated with the latest code. This is a key facet of continuous “integration.” To do so, we need visibility back to the commit from which the branch was created. This could be 5, 15, 150, or 5000 commits depending on the age of the source branch and activity on the destination branch. Our solution is to start with a modest commit depth (for PRs) and increase it until we have sufficient history to perform the merge. 

Minimizing branch refs is more straightforward. For branches without a PR, all you need is a single branch ref (for the source branch). For pull request branches, you need a second ref for the destination. Both cases are more efficient than downloading thousands of superfluous branch refs. 

Tying it all together 

In the end, I would say that the effort of building and maintaining CI as a Service is a small price to pay for the value it brings in terms of efficiency, usability, and even observability and security (which I didn’t have time to touch on). Combine a monorepo with a standardized development environment, common build tool, and unified CI/CD, and you have a potent platform for fast, reliable software delivery. 

Read the rest of the series:

Stay in touch

Fill out the form below and be the first to know when we release new blogs.


Share this post