JGriffin's Blog

Is this thing on?

Monthly Archives: August 2015

Engineering Productivity Update, August 26, 2015

It’s PTO season and many people have taken a few days or week off.  While they’re away, the team continues making progress on a variety of fronts.  Planning also continues for GoFaster and addon-signing, which will both likely be significant projects for the team in Q4.

Highlights

Treeherder: camd rolled out a change which collapses chunked jobs on Treeherder, reducing visual noise.  In the future, we plan on significantly increasing the number of chunks of many jobs in order to reduce runtimes, so this change makes that work more practical.  See camd’s blog post.  emorley has landed a change which allows TaskCluster job errors that occur outside of mozharness to be properly handled by Treeherder.

Automatic Starring: jgraham has developed a basic backend which supports recognizing simple intermittent failures, and is working on integrating that into Treeherder; mdoglio is landing some related database changes. ekyle has received sheriff training from RyanVM, and plans to use this to help improve the automated failure recognition algorithm.

Perfherder and Performance Testing: Datazilla has finally been decommissioned (R.I.P.), in favor of our newest performance analysis tool, Perfherder.  A lot of Talos documentation updates have been made at https://wiki.mozilla.org/Buildbot/Talos, including details about how we perform calculations on data produced by Talos.  wlach performed a useful post-mortem of Eideticker, with several takeaways which should be applicable to many other projects.

MozReview and Autoland: There’s a MozReview meetup underway, so expect some cool updates next time!

TaskCluster Support: ted has made a successful cross-compiled OSX build using TaskCluster!  Take it for a spin.  More work is needed before we can move OSX builds from the mac mini builders to the cloud.

Mobile Automation: gbrown continues to make improvements on the new |mach emulator| command which makes running Android tests locally on emulator very simple.

General Automation: run-by-dir is live on opt mochitest-plain; debug and ASAN coming soon.  This reduces test “bleed-through” and makes it easier to change chunking.  adusca, our Outreachy intern, is working to integrate the try extender into Treeherder.  And ahal has merged the mozharness “in-tree” configs with the regular mozharness config files, now that mozharness lives in the tree.

Firefox Automation: YouTube ad detection has been improved for firefox-media-tests by maja, which fixes the source of the top intermittent failure in this suite.

Bughunter: bc has got asan-opt builds running in production, and is working on gtk3 support.

hg.mozilla.org: gps has enabled syntax highlighting in hgweb, and has added a new JSON API as well.  See gps’ blog post.

The Details

bugzilla.mozilla.org
Treeherder
Perfherder/Performance Testing
  • talos cleanup and preparation to move in-tree
  • perfherder database cleanup in progress for simpler and more optimized queries. This is mainly preparatory work for making perfherder capable of managing/starring performance alerts, but as a bonus perfherder compare view should load virtually instantly once this is finished. 
  • most talos wiki docs are updated: https://wiki.mozilla.org/Buildbot/Talos
TaskCluster Support
Mobile Automation
  •  [gbrown] Working on “mach emulator” support: wip can download and run 2.3, 4.3, or x86 emulator images. Integrating with other mach commands like “install” and “mochitest”.
  •  [gbrown] Updated mochitest manifests to run most dom/media mochitests on Android 4.3 (under review, bug 1189784)
Firefox and Media Automation
  • [maja_zf] Improved ad detection on YouTube for firefox-media-tests, which fixes our top intermittent failure for long-running playback tests.
General Automation
  •  run-by-dir is live for mochitest-plain (opt only); debug is coming soon, followed by ASAN.
  • Mozilla CI tools is moving from using BuildAPI as the scheduling entry point to use TaskCluster’ scheduling. This work will allow us to schedule a graph of buildbot jobs and their dependencies in one shot. https://bugzil.la/1194264
  • adusca is integrating into treeherder the ability to extend the jobs run for any push. This is based on the http://try-extender.herokuapp.com prototype. Follow along in https://bugzil.la/1194830
  • Git was deployed to the test machines. This is necessary to make the Firefox UI update tests work on them.
  • [ahal] merge mozharness in-tree configs with the main mozharness configs
ActiveData
  • Bug fixes to the ETL – fix bad lookups on hg repo, mostly l10n builds 
  • More error reporting on ETL – Structured logging has changed a little, handle the new variations, and be more elegant when it comes to unknowns, an complain when there is non-conformance.
  • Some work on adding hg `repo` table – acts as a cache for ETL, but can be used to calculate ‘per push’ statistics on OrangeFactor data.
  • Added Talos to the `perf` table – used the old Datazilla ETL code to fill the ES cluster.  This may speed up extracting the replicates, for exploring the behaviour of a test.
  • Enable deep queries – Effectively performing SQL join on Elasticsearch – first attempt did too much refactoring.  Second attempt is simpler, but still slogging through all the resulting test breakage
hg.mozilla.org
WebDriver
  • Updated 
Marionette
  • [ahal] helped review and finish contributor patch for switching marionette_client from optparse to argparse
  • Corrected UUID used for session ID and element IDs
  • Updated dispatching of various marionette calls in Gecko
bughunter
  • [bc] Have asan-opt builds running in production. Finalizing patch. Still need to build gtk3 for rhel6 32bit in order to stop using custom builds and support opt in addition to debug.
charts.mozilla.org
  • Updated the hierarchical burndowns to EPM’s important metabugs that track features 
  • More config changes

Engineering Productivity Update, August 13, 2015

From Automation and Tools to Engineering Productivity

Automation and Tools” has been our name for a long time, but it is a catch-all name which can mean anything, everything, or nothing, depending on the context. Furthermore, it’s often unclear to others which “Automation” we should own or help with.

For these reasons, we are adopting the name “Engineering Productivity”. This name embodies the diverse range of work we do, reinforces our mission (https://wiki.mozilla.org/Auto-tools#Our_Mission), promotes immediate recognition of the value we provide to the organization, and encourages a re-commitment to the reason this team was originally created—to help developers move faster and be more effective through automation.

The “A-Team” nickname will very much still live on, even though our official name no longer begins with an “A”; the “get it done” spirit associated with that nickname remains a core part of our identity and culture, so you’ll still find us in #ateam, brainstorming and implementing ways to make the lives of Mozilla’s developers better.

Highlights

Treeherder: Most of the backend work to support automatic starring of intermittent failures has been done. On the front end, several features were added to make it easier for sheriffs and others to retrigger jobs to assist with bisection: the ability to fill in all missing jobs for a particular push, the ability to trigger Talos jobs N times, the ability to backfill all the coalesced jobs of a specific type, and the ability to retrigger all pinned jobs. These changes should make bug hunting much easier.  Several improvements were made to the Logviewer as well, which should increase its usefulness.

Perfherder and performance testing: Lots of Perfherder improvements have landed in the last couple of weeks. See details at wlach’s blog post.  Meanwhile, lots of Talos cleanup is underway in preparation for moving it into the tree.

MozReview: Some upcoming auth changes are explained in mcote’s blog post.

Mobile automation: gbrown has converted a set of robocop tests to the newly enabled mochitest-chrome on Android. This is a much more efficient harness and converting just 20 tests has resulted in a reduction of 30 minutes of machine time per push.

Developer workflow: chmanchester is working on building annotations into moz.build files that will automatically select or prioritize tests based on files changed in a commit. See his blog post for more details. Meanwhile, armenzg and adusca have implemented an initial version of a Try Extender app, which allows people to add more jobs on an existing try push. Additional improvements for this are planned.

Firefox automation: whimboo has written a Q2 Firefox Automation Report detailing recent work on Firefox Update and UI tests. Maja has improved the integration of Firefox media tests with Treeherder so that they now officially support all the Tier 2 job requirements.

WebDriver and Marionette: WebDriver is now officially a living standard. Congratulations to David Burns, Andreas Tolfsen, and James Graham who have contributed to this standard. dburns has created some documentation which describes which WebDriver endpoints are implemented in Marionette.

Version control: The ability to read and extra metadata from moz.build files has been added to hg.mozilla.org. This opens the door to cool future features, like the ability auto file bugs in the proper component and automatically selecting appropriate reviewers when pushing to MozReview. gps has also blogged about some operational changes to hg.mozilla.org which enables easier end-to-end testing of new features, among other things.

The Details

bugzilla.mozilla.org
Treeherder/Automatic Starring
  • almost finished the required changes to the backend (both db schema and data ingestion)
Treeherder/Front End
  • Several retrigger features were added to Treeherder to make merging and bisections easier:  auto fill all missing/coalesced jobs in a push; trigger all Talos jobs N times; backfill a specific job by triggering it on all skipped commits between this commit and the commit that previously ran the job, retrigger all pinned jobs in treeherder.  This should improve bug hunting for sheriffs and developers alike.
  • [jfrench] Logviewer ‘action buttons’ are now centralized in a Treeherder style navbar https://bugzil.la/1183872
  • [jfrench] Logviewer skipped steps are now recognized as non-failures and presented as blue info steps https://bugzil.la/1192195, https://bugzil.la/1192198
  • [jfrench] Middle-mouse-clicking on a job in treeherder now launches the Logviewer https://bugzil.la/1077338
  • [vaibhav] Added the ability to retrigger all pinned jobs (bug https://bugzil.la/1121998)
  • Camd’s job chunking management will likely land next week https://bugzil.la/1163064
Perfherder/Performance Testing
  • [wlach] / [jmaher] Lots of perfherder updates, details here: http://wrla.ch/blog/2015/08/more-perfherder-updates/ Highlights below
  • [wlach] The compare pushes view in Perfherder has been improved to highlight the most important information.
  • [wlach] If your try push contains Talos jobs, you’ll get a url for the Perfherder comparison view when pushing (https://bugzil.la/1185676).
  • [jmaher/wlach] Talos generates suite and test level metrics and perfherder now ingests those data points. This fixes results from internal benchmarks which do their own summarization to report proper numbers.
  • [jmaher/parkouss] Big talos updates (thanks to :parkouss), major refactoring, cleanup, and preparation to move talos in tree.
MozReview/Autoland
Mobile Automation
  •  [gbrown] Demonstrated that some all-javascript robocop tests can run more efficiently as mochitest-chrome; about 20 such tests converted to mochitest-chrome, saving about 30 minutes per push.
  •  [gbrown] Working on “mach emulator” support: wip can download and run 2.3, 4.3, or x86 emulator images. Sorting out cache management and cross-platform issues.
  •  [jmaher/bc] landed code for tp4m/tsvgx on autophone- getting closer to running on autophone soon.
Dev Workflow
  • [ahal] Created patch to clobber compiled python files in srcdir
  • [ahal] More progress on mach/mozlog patch https://bugzil.la/1027665)
  • [chmanchester] Fix to allow ‘mach try’ to work without test arguments (bug 1192484)
Media Automation
  • [maja_zf] firefox-media-tests ‘log steps’ and ‘failure summaries’ are now compatible with Treeherder’s log viewer, making them much easier to browse. This means the jobs now satisfy all Tier-2 Treeherder requirements.
  • [sydpolk] Refactoring of tests after fixing stall detection is complete. I can now take my network bandwidth prototype and merge it in.
Firefox Automation
General Automation
  • Finished adapting mozregression (https://bugzil.la/1132151) and mozdownload (https://bugzil.la/1136822) to S3.
  • (Henrik) Isn’t archive.mozilla.org only a temporary solution, before we move to TC?
  • (armenzg) I believe so but for the time being I believe we’re out of the woods
  • The manifestparser dependency was removed from mozprofile (bug 1189858)
  • [ahal] Fix for https://bugzil.la/1185761
  • [sydpolk] Platform Jenkins migration to the SCL data center has not yet begun in earnest due to PTO. Hope to start making that transition this week.
  • [chmanchester] work in progress to build annotations in to moz.build files to automatically select or prioritize tests based on what changed in a commit. Strawman implementation posted in https://bugzil.la/1184405 , blog post about this work at http://chmanchester.github.io/blog/2015/08/06/defining-semi-automatic-test-prioritization/
  • [adusca/armenzg] Try Extender (http://try-extender.herokuapp.com ) is open for business, however, a new plan will soon be released to make a better version that integrates well with Treeherder and solves some technichal difficulties we’re facing
  • [armenzg] Code has landed on mozci to allow re-triggering tasks on TaskCluster. This allows re-triggering TaskCluster tasks on the try server when they fail.
  • [armenzg] Work to move Firefox UI tests to the test machines instead of build machines is solving some of the crash issues we were facing
ActiveData
  • [ahal] re-implemented test-informant to use ActiveData: http://people.mozilla.org/~ahalberstadt/test-informant/
  • [ekyle] Work on stability: Monitoring added to the rest of the ActiveData machines.  
  • [ekyle] Problem:  ES was not balancing the import workload on the cluster; probably because ES assumes symmetric nodes, and we do not have that.  The architecture was changed to prefer a better distribution of work (and query load) – There now appears to be less OutOfMemoryExceptions, despite test-informant’s queries.
  • [ekyle] More Problems:  Two servers in the ActiveData complex failed: The first was the ActiveData web server; which became unresponsive, even to SSH.  The machine was terminated.  The second server was the ‘master’ node of the ES cluster: This resulted in total data loss, but it was expected to happen eventually given the cheap configuration we have.   Contingency was in place:  The master was rebooted, the  configuration was verified, and data re-indexed from S3.   More nodes would help with this, but given the rarity of the event, the contingency plan in place, and the low number of users, it is not yet worth paying for. 
WebDriver (highlights)
  • [ato] WebDriver is now officially a living standard (https://sny.no/2015/08/living)
  • [ato] Rewrote chapter on extending the WebDriver protocol with vendor-specific commands
  • [ato] Defined the Get Element CSS Value command in specification
  • [ato] Get Element Attribute no longer conflates DOM attributes and properties; introduces new command Get Element Property
  • [ato] Several significant infrastructural issues with the specification was fixed
Marionette
hg.mozilla.org
charts.mozilla.org
  • Project managers for FxOS have a renewed interest in the project tracking, and overall status dashboards.   Talk only, no coding yet.