Is this thing on?
GoFaster: deeper data analysis
September 6, 2011Posted by on
For the GoFaster project, releng and the A-team have been working on various tasks which we hope will result in getting the total commit to all-tests-done time down to 2 hours for the main branches (try excluded). This total turnaround time was 6-8 hours a couple of months ago when we began this project.
We’ve recently made some improvements that seriously reduce the total machine time required to run all tests for a given commit. These include hiding the mochitest results table, removing packed.js from mochitest, and streamlining individual slow tests (see bug 674738, bug 676412, and bug 670229). These together have reduced the total machine time for test down from about 40 hours to around 25 hours per commit, a big win.
However, the total turnaround times are still much slower than our goal:
We already knew that PGO builds are slow, and jhford is working on turning on-demand builds into non-PGO builds, and make PGO builds every four hours (bug 658313). However, we needed a way to dig deeper into the data to see what our other pain points are.
Will Lachance made some awesome build charts which help us visualize what’s going on in these buildbot jobs. Clicking any commit will show a chart that displays all the relevant buildbot jobs in relative clock time; this makes it easier to see where the bottlenecks are.
Display the build chart for just about any commit (e58e98a89827 for instance), and you’ll see the problem right away: just about every commit includes builds that far exceed 2 hours. These aren’t always opt builds, and they sometimes occur even on our ‘fast’ OS: linux. Check out 5d9989c3bff6, which has a linux64 opt build that takes 214 min, compared to the linux32 opt build that takes 61 minutes. 198c7de0699d has an OSX 10.5 debug build that takes 171 minutes, but the 10.6 debug build takes only 82 minutes. Clearly, we can’t hit our 2-hour goal with builds that take 2+ hours. What’s going on?
It’s necessary to spend a little time digging through build logs to find out. It turns out there are multiple factors.
- We already know that PGO builds are slow, particularly on Windows. Once bug 658313 lands, we expect the overall situation to improve dramatically.
- On some builds, the ‘update’ step includes a full ‘hg clone’ of mozilla-central, while others use ‘hg pull -u’. Below is a graph of update times; the average time for an update that includes ‘hg clone’ is 12.9 min, for those that use ‘hg pull’ the average is 0.6 min. Each full clone is costing us an average of 12 minutes.
- On some build slaves, we do a full build (with no obj dir from a previous build), on others we do an incremental build. Below is a graph showing incremental vs full compile times for opt and debug builds. On average, full compiles are taking 17 minutes longer than incremental ones.
- We have a mix of slow and fast slaves. This can easily be seen in the below graph of linux compile times. On linux and linux64 builds, full compiles with moz2-linux(64)-* slaves are slow (those > 75 min), while those made with linux(64)-ix-* slaves are fast (those < 75 min). 32-bit mac builds show a similar split, with those on moz2-darwin9* slaves slow, and those on bm-xserve* slaves fast. Hardware doesn’t appear to create a significant difference for windows and 64-bit mac builds.
- On macosx64 machines, the ‘alive test’ step takes an average of 6 min (vs 1 min on other os’s).
- The ‘checking clobber times’ step often takes just a couple of seconds, however when this step actually results in some clobbering being done, it can take up to 21 minutes (average: 6 min).
When all these factors coincide, we can get builds (which include compile, update, and other steps) that exceed 4 hours. This suggests doing away with on-demand PGO builds may not in itself get us to our 2-hour goal.
From this data, two of the more obvious ways to improve our build times might be:
- Investigate retiring slow linux and 32-bit mac build slaves.
- Investigate ways to reduce clobbering. Clobbering itself takes time (see bullet #6 above), but also indirectly costs time through increased update and compile times. Currently, about 51% of our builds are operating on clobbered slaves, requiring full hg clones and full compiles. If this number could be reduced, we might see a significant reduction in our average turnaround times.
According to Will’s build charts, the E2E time for tests is often within our 30-minute target range. The exception is mochitest-other on debug builds, which often takes from 60 to 90 minutes. We could improve this situation somewhat by splitting mochitest-browser-chrome (the longest-running chunk of mochitest-other) into its own test job.
Additionally, wait times for test slaves running android and win 7 tests is sometimes non-trivial; see e.g. the details for commit 97216ae0fc04. We should try to understand why this happens; the graph of test wait times doesn’t show a clear trend, other than highlighting the fact that wait times for windows and android are usually worse than the other os’s.