Fixing the build queue bottleneck at JetBrains

At JetBrains, we use a single TeamCity set up to construct all of our merchandise – together with the giants like IntelliJ IDEA, PyCharm, Kotlin, in addition to a whole lot of plugins, and quite a lot of inside services, corresponding to the web site.

In whole, it quantities to 7500 tasks, 50,000+ construct configurations (jobs), and round 65,000 builds per day on common. That is all dealt with by 2000+ construct brokers (though this quantity isn’t static as quite a lot of them are launched on demand and run builds in hosted environments, corresponding to AWS).

All of that’s dealt with by our inside CI/CD server referred to as buildserver which is operating as a multi-node setup orchestrated by TeamCity. It’s all a part of the usual TeamCity performance, aside from that it receives its new options and updates every day, as we use it for inside dogfooding.

In some unspecified time in the future, this scale began posing some difficulties for us, as we observed that in peak hours, newly triggered builds had been sitting within the queue for half an hour or extra, and mainly didn’t begin in any respect until you manually moved it to the highest.

On this second, messages from the JetBrains builders had been piling up within the teamcity-buildserver Slack channel:

People, my construct is already greater than 2 hours in queue. I see that there are related issues above with reaching limits however I don’t see such messages, simply No estimate but.

Builds appear to maintain piling up in queue and taking longer and longer, are we heading towards a reboot?

Why does TeamCity begin new builds so slowly? 20 minutes already, brokers are idle.

Hello! I’m ready for my construct to start out for 20 minutes, is it meant? Throughout this time I see the agent abstract modifications, however the configurations don’t begin. What’s flawed?

We felt this ache. It will not be an exaggeration to say that this subject was driving a few of us loopy for fairly a while.

The construct queue bottleneck was so annoying that we’ve determined to dig deeply into the core code of the product. Ultimately, we solved it. Right here is the rundown of how we approached the issue.



The ready causes

For TeamCity to start out a construct, quite a lot of situations ought to be glad. As an example, cloud brokers ought to be launched, contemporary commits and settings ought to be fetched from VCS repositories. If the construct will depend on some shared assets, then these assets ought to grow to be out there. If there are snapshot dependencies, they need to be completed too. Lastly, when all of the situations are met, the queue optimization kicks in and checks whether or not the construct ought to be began in any respect or whether or not an already completed construct may very well be reused as an alternative (if this completed construct has the identical revision).

Expertise of analyzing points reported by our customers exhibits that their configuration might be fairly tough and generally it’s counting on nuanced conduct of a number of the TeamCity options. So after we get a grievance {that a} construct doesn’t begin, it’s not instantly clear the place the issue is. The issue might be brought on by some misconfiguration or there generally is a efficiency downside in TeamCity itself. In any case we’ve to test quite a lot of various things to seek out out why the construct isn’t beginning.

How a lot time was spent on checking for modifications? Effectively, one can go to the construct log and see this info there. How a lot time we spent ready for a shared useful resource? Sadly, it’s not possible to see it anyplace within the UI, however one can go to the server logs and discover it there. How a lot time was spent on the queue processing? Once more, solely one of many server logs has this info.

General this was a serious time waster. Though internally TeamCity had all of the data about period of various levels of the queued construct, this info was proven within the consumer interface solely after the construct had already completed and at this level it’s too late to search for a bottleneck.

Once we realized that we’re losing time, we did one comparatively easy enchancment that helped us quite a bit. We began exhibiting the the explanation why the queued construct doesn’t begin on the queued construct web page:

Waiting for build queue distribution

After that we lastly began to see that with numerous builds within the construct queue the explanation “Ready for the construct queue distribution course of” was taking an unusually very long time.



Construct queue processing

In TeamCity, all of the queued builds are being processed by a single thread in a loop. So the ready cause “Ready for the construct queue distribution course of” mainly exhibits how a lot time TeamCity was processing different builds within the queue earlier than it reached the present construct.

In a bit simplified type the method is as follows:

  1. Run an optimization algorithm and take away all of the out of date construct chains (a construct chain is out of date if a more recent construct chain with newer set of modifications can also be within the queue).
  2. Enter the loop over the queued builds.
  3. For a queued construct:

3.1. Schedule a checking for modifications operation or get hold of a contemporary copy of settings from the model management.
3.2. Examine preconditions: whether or not all dependencies are completed, or assets have grow to be out there, and so forth.
3.3. Discover idle appropriate brokers and schedule begin of the brand new cloud brokers if vital.
3.4. If an idle agent exists, test whether or not we actually want to start out the construct, or possibly we are able to reuse an already operating or a completed one.
3.5. If the construct ought to be began: take away it from the queue, and schedule it’s begin in a separate thread.
4.Go to Step 1.

Not all of the elements of this course of had been seen as ready causes. Some elements had been too low stage to be proven within the consumer interface. Nevertheless it made sense so as to add metrics for every of those operations in order that we may present them on the Grafana dashboard.

It took some trial and error till we began to see how a lot time totally different levels had been occupying.

TeamCity Grafana dashboard with different stages of the build queue processing



The “Transfer to high” subject

We already knew that the queue processing time was often a lot smaller than the delays noticed by the customers. As an example, after they claimed that their construct was within the queue for 40 minutes, your complete queue may very well be processed by the server for about 5 minutes (this isn’t quick in any respect, but it surely’s not 40 minutes both). Newly added metrics confirmed it as soon as once more: the delays are considerably smaller than these reported by the customers. However why are the noticed delays so excessive?

However then we discovered that there are a lot of customers who always use the “transfer to high” motion within the construct queue. In all probability there are additionally computerized scripts which put builds to the highest of the queue. So the builds of those that ended up complaining about big delays had been most likely always overtaken by builds of their colleagues. These “moved to the highest” builds had been taking all of the brokers too. No marvel different builds couldn’t begin for for much longer.



Equity

Additional evaluation of the queue conduct confirmed that some builds may take quite a lot of time to course of, whereas others had been in a position to begin instantly. It was not but clear why, however on condition that the queue is processed by a single thread this raises a sound concern about equity of the construct queue. Why ought to some construct look ahead to minutes simply because another challenge always triggers quite a lot of builds?

We weren’t ready but to pinpoint the foundation reason for the overall slowness, however we nonetheless wanted to supply a correct service to our customers. As a brief measure we determined so as to add a per-project restrict for plenty of queued builds to course of by a single iteration.

As an example, lets say that the server processes no more than a 100 queued builds for a high stage challenge corresponding to IntelliJ IDEA per single iteration of the loop. This could permit us to restrict the assets spent on a single challenge and make the method extra honest.

As soon as carried out, a number of the tasks with a number of triggered builds began exhibiting a brand new ready cause:

Reached the restrict for plenty of concurrently beginning builds on this challenge: 50

With this workaround in place we’ve purchased us a while to consider a correct repair because the delays now had been at the very least bearable.



Construct queue parallelization

There have been intensive discussions in regards to the risk to parallelize the processing of the queue.

To make clear: processing of the queue is all the things moreover the precise begin of the construct on an agent. The beginning on an agent is the place community communication occurs and thankfully this half is already out of the primary loop and is finished in a separate pool of threads. Nevertheless, it’s not really easy to parallelize the primary loop as a result of the builds must be processed within the order. If a construct was moved larger within the construct queue then it ought to begin earlier than builds that are positioned under. In any other case that is not a queue however relatively an unordered assortment.

However what if we divide the construct queue by agent swimming pools? Throughout the similar set of brokers, the queued builds may very well be dealt with within the order, however the teams of builds utilizing totally different units of brokers may very well be dealt with in parallel. Effectively, apparently many tasks are residing in a number of swimming pools. A few of these swimming pools are widespread with different unbiased tasks. So it was not clear if this division by brokers may truly work.

Plainly the division by agent swimming pools may nonetheless be carried out comparatively simply. Nevertheless, the “parallel processing” a part of the duty was not really easy. In TeamCity the construct queue is a typical useful resource, with optimized concurrent reads. However concurrent modifications aren’t quick. So there’s a probability that with parallelized processing there will probably be a brand new bottleneck: modification and persisting of the construct queue itself. Elimination of this bottleneck appeared like a job for months, however we wanted a extra quick time period answer.

A brief time period answer can be to make the primary loop undergo the queue and put the builds prepared for the beginning into another queue which may very well be dealt with by some separate thread working in parallel. On this case precise modification and persisting of the queue would occur in a single thread solely. This appeared like a extra incremental method which might additionally permit division by agent swimming pools someday later.

With this in thoughts, we’ve performed a number of refactorings in the direction of this method. We additionally had to enhance the construct queue optimization algorithm to make sure clear separation of the primary loop thread and a thread which begins the builds marked for the beginning. And whereas altering the code we’ve discovered one thing fascinating.



Suitable brokers

With the newly printed metrics, we’ve observed that calculation of the appropriate brokers was taking quite a lot of time in some circumstances. To the purpose that this calculation was a dominating issue of the entire queue processing.

This appeared bizarre. To start with compatibility with an agent is cached inside a queued construct, so it shouldn’t be a continuing subject. Secondly, we solely must compute compatibility with plenty of presently idle brokers which is meant to be small.

However the second half was not true apparently:

TeamCity compatible agents screen

With mass adoption of AWS cloud brokers it turned potential to have a number of a whole lot of idle brokers in the identical agent pool, all of them being appropriate with a single queued construct.

However why’d there be dozens or generally a whole lot of “JpsBootstrap Compile” builds within the queue? Effectively, as a result of it participates in lots of construct chains: Secure push, Fleet challenge, Kotlin, and so forth. And these construct chains are triggered very often. A few of them are even triggered on each VCS commit.

Now it has grow to be clear why compatibility calculation is a serious bottleneck in tasks with many a whole lot of brokers corresponding to IntelliJ IDEA and why queued builds of this challenge delay the beginning of different builds.



Again to the queue processing algorithm

If we have a look at the queue processing algorithm once more, then we’ll see one thing odd there:

3.3. Compute appropriate brokers and schedule begin of the brand new cloud brokers if vital.
3.4. If an idle agent exists, test whether or not we actually want to start out the construct, or possibly we are able to reuse an current one.

So we first undergo the a whole lot of idle brokers to seek out appropriate brokers amongst them, however then we determine whether or not we actually must run this construct? That is odd. Why don’t we first determine if the queued construct might be changed with another construct, and provided that there isn’t any appropriate construct, and we actually want to start out a brand new one, then we’d compute appropriate brokers? The repair regarded like a low hanging fruit.

As soon as carried out and deployed this straightforward optimization prompted most likely probably the most drastic impact on the construct queue processing time. It turned out that it’s less expensive to run the construct queue optimization course of relatively than compute appropriate brokers, if the variety of brokers is kind of excessive.



Conclusion

It took us a number of months to lastly resolve the difficulty. The trail to the answer was not simple. A number of small and never so small optimizations had been carried out alongside the way in which.

The takeaway for us right here is to not be shy to dig into the code and check out one thing new. We discovered the issue with the appropriate brokers as a result of we had been altering the code, making an attempt to arrange it for the parallelization. With out this refactoring, our probabilities to identify the actual trigger had been fairly low.

That is how the construct queue processing appears to be like now in our metrics:

TeamCity build processing queue metrics

The delays are under 10 seconds more often than not, which appears acceptable for our buildserver and so far as we are able to see there aren’t any extra complaints on the Slack channel both.

All the new optimizations turned a part of the TeamCity construct that we distribute to our prospects. Hopefully they felt these enhancements too!

Add a Comment

Your email address will not be published. Required fields are marked *