Short version of my talk on how to keep CI/CD pipelines as fast as needed. This presentation delves into why fast build pipelines are important and explores different approaches to achieve and measure this.
35. Build Time (BT): time an individual build takes
to run
Change Rate (CR): percentage of commits upon
an individual build with respect to the whole
system
Useful Metrics
43. Weighted Impact Time (WIT): impact time of a build
weighted according to its change rage
WIT(A) = IT(A) * CR(A)
Useful Metrics
44. Average Impact Time (AIT): total time needed, on
average, to execute all necessary builds after any
given commit anywhere in the system
AIT = WIT(A) + WIT(B) + ... + WIT(Z)
Useful Metrics
46. Average Impact Time
Average Impact Time is what indicates how well you
have scaled your system
Sample Thresholds
47. Maximum Impact Time
In a worst-case scenario, a build won’t take longer
than this.
Sample Thresholds
48. Maximum Impact Time for Critical Components
The same, but only for your most sensitive modules
(log-in, payment gateway, etc.)
Beware of dependencies!
Sample Thresholds
As you can see, when your pipeline isn’t building fast enough, the trick is reshaping the architecture in a way that red builds are located towards the right. However, that is if you do need to make your pipeline more efficient, but do you need to?
PERFORMANCE ESTABLISH THRESHOLD, MEASURE, CHANGE IF ABOVE
35
CALCULATE IN DIFFERENT WAYS, DEPENDING ON OUR PARALLEL EXECUTION CAPABILITIES
If we don’t have the ability to run builds in parallel, then we’ll run A and then B and C (or C and B). In any case, the impact time will the sum of all of them.
CLICK
If we allow parallel execution, then both B and C will be triggered at the same time after A, which means we’ll only have to wait for the slowest of the two.
CLICK
If we allow parallel execution, then both B and C will be triggered at the same time after A, which means we’ll only have to wait for the slowest of the two.
CLICK
Bear in mind these are only approximations. In real life it can be that your ability to run things in parallel is limited by total number of slaves (maybe you can only run up to 5 builds in parallel) or other shared resources (maybe you only have one staging database and two builds cannot get hold of it at the same time). But, despite being approximations, they are a good way to establish a baseline to track and compare.
CLICK
There is something interesting to note about Impact Time, and is that this grows as you go up in the hierarchy. This graph shows the Build Time as the size of the bubbles, but the Impact Time of each bubble will include directly or indirectly that of its dependants. This means that the Parent POM file will be the build with the highest Impact Time, since whenever we change that build we have to rebuild absolutely everything. Now, is that a problem? Maybe not, because it’s also the least modified build (hence its colour). This leads us to conclude that we need to assess the relationship between Impact Time and Change Rate, which brings us to the next metric.
CLICK
This value allows us to compare which builds are the ones causing the highest impact over a period of time, letting us know when an impactful build is infrequent enough so as not to be a problem. And then, by combining all the weighted impact times.
CLICK
We get to the Average Impact Time, which will tell us how long, on average, it takes for our build system to rebuild all the necessary modules after a commit anywhere in the system. Now we’re really getting onto something, because now that we have all these metrics we have a way to define (CLICK) useful thresholds for us.
45
46
47
48
Now, let’s take a moment to reflect on all this. We’re defining metrics based on build duration, but also on change rate. And we are considering architectural changes, restructuring of modules, based on these data. But let’s take a closer look at this this temperature graph. It is driven by dependencies among builds, but also by where I am making changes. That means that some of the attributes of this graph will change over time as developers focus on different parts of the system so as to develop different features. That means that the optimal shape of the system will change according to the data of our build, and what was a good idea yesterday may not be so much today.
Let’s also note that all these graphs are created manually. And I also had to do the analysis manually. I had to do these manually because there aren’t any tools (that I know of) that can provide this information for you. And, useful as this is, you can’t do it too often because CLICK manual processing takes time.
This is a F1 driving wheel. Only the wheel costs $30,000 to produce. Do you see all those buttons, dials and displays? A F1 car has tons of sensors all over the place that measure from the temperature of the tires, to the weight of the car (as it becomes lighter as it consumes fuel), to the aerodynamic forces experienced by different surfaces. All that data is analysed and presented to the driver, so the driver can adjust different parameters to adapt the performance of the car to the circumstances. This is the kind of approach that we need to take if we want a fast build pipeline: we need to measure, we need to analyse, and we need to spend time, effort and money to build the necessary tools to manage the pipeline. It can’t be a pet project. Now, if you’re interested in building this kind of tool, let’s have a chat.
CLICK
CI/CD can be your worst bottleneck
Keeping your CI/CD fast is a performance tuning activity, approach it as such
No proper tools available, help me build them