2014: Engineering Operations Year in Review

On the first day of Mozlandia, Johnny Stenback and Doug Turner presented a list of key accomplishments in Platform Engineering/Engineering Operations in 2014.

I have been told a few times recently that people don’t know what my teams do, so in the interest of addressing that, I thought I’d share our part of the list. It was a pretty damn good year for us, all things considered, and especially given the level of organizational churn and other distractions.

We had a bit of organizational churn ourselves. I started the year managing Web Engineering, and between March and September ended up also managing the Release Engineering teams, Release Operations, SUMO and Input Development, and Developer Services. It’s been a challenging but very productive year.

Here’s the list of what we got done.

Web Engineering

  • Migrate crash-stats storage off HBase and into S3
  • Launch Crash-stats “hacker” API (access to search, raw data, reports)
  • Ship fully-localized Firefox Health Report on Android
  • Many new crash-stats reports including GC-related crashes, JS crashes, graphics adapter summary, and modern correlation reports
  • Crash-stats reporting for B2G
  • Pluggable processing architecture for crash-stats, and alternate crash classifiers
  • Symbol upload system for partners
  • Migrate l10n.mozilla.org to modern, flexible backend
  • Prototype services for checking health of the browser and a support API
  • Solve scaling problems in Moztrap to reduce pain for QA
  • New admin UI for Balrog (new update server)
  • Bouncer: correctness testing, continuous integration, a staging environment, and multi-homing for high availability
  • Grew Air Mozilla community contributions from 0 to 6 non-staff committers
  • Many new features for Air Mozilla including: direct download for offline viewing of public events, tear out video player, WebRTC self publishing prototype, Roku Channel, multi-rate HLS streams for auto switching to optimal bitrate, search over transcripts, integration with Mozilla Popcorn functionality, and access control based on Mozillians groups (e.g. “nda”)

DXR

  • Modeless, explorable UI with all-new JS
  • Case-insensitive searching
  • Proof-of-concept Rust analysis
  • Improved C++ analysis, with lots of new search types
  • Multi-tree support
  • Multi-line selection (linkable!)
  • HTTP API for search
  • Line-based searching
  • Multi-language support (Python already implemented, Rust and JS in progress)
  • Elasticsearch backend, bringing speed and features
  • Completely new plugin API, enabling binary file support and request-time analysis

SUMO

  • Offline SUMO app in Marketplace
  • SUMO Community Hub
  • Improved SUMO search with Synonyms
  • Instant search for SUMO
  • Redesigned and improved SUMO support forums
  • Improved support for more products in SUMO (Thunderbird, Webmaker, Open Badges, etc.)
  • BuddyUP app (live support for FirefoxOS) (in progress, TBC Q1 2015)

Input

  • Dashboards for everyone infrastructure: allowing anyone to build charts/dashboards using Input data
  • Backend for heartbeat v1 and v2
  • Overhauled the feedback form to support multiple products, streamline user experience and prepare for future changes
  • Support for Loop/Hello, Firefox Developer Edition, Firefox 64-bit for Windows
  • Infrastructure for automated machine and human translations
  • Massive infrastructure overhaul to improve overall quality

Release Engineering

  • Cut AWS costs by over 70% during 2014 by switching builds to spot instances and using intelligent bidding algorithms
  • Migrated all hardware out of SCL1 and closed datacenter to save $1 million per year (with Relops)
  • Optimized network transfers for build/test automation between datacenters, decreasing bandwidth usage by 50%
  • Halved build time on b2g-inbound
  • Parallelized verification steps in release automation, saving over an hour off the end-to-end time required for each release
  • Decommissioned legacy systems (e.g. tegras, tinderbox) (with Relops)
  • Enabled build slave reboots via API
  • Self-serve arbitrary builds via API
  • b2g FOTA updates
  • Builds for open H.264
  • Built flexible new update service (Balrog) to replace legacy system (will ship first week of January)
  • Support for Windows 64 as a first class platform
  • Supported FX10 builds and releases
  • Release support for switch to Yahoo! search
  • Update server support for OpenH264 plugins and Adobe’s CDM
  • Implement signing of EME sandbox
  • Per-checkin and nightly Flame builds
  • Moved desktop firefox builds to mach+mozharness, improving reproducibility and hackability for devs.
  • Helped mobile team ship different APKs targeted by device capabilities rather than a single, monolithic APK.

Release Operations

  • Decreased operating costs by $1 million per year by consolidating infrastructure from one datacenter into another (with Releng)
  • Decreased operating costs and improved reliability by decommissioning legacy systems (kvm, redis, r3 mac minis, tegras) (with Releng)
  • Decreased operating costs for physical Android test infrastructure by 30% reduction in hardware
  • Decreased MTTR by developing a simplified releng self-serve reimaging process for each supported build and test hardware platforms
  • Increased security for all releng infrastructure
  • Increased stability and reliability by consolidating single point of failure releng web tools onto a highly available cluster
  • Increased network reliability by developing a tool for continuous validation of firewall flows
  • Increased developer productivity by updating windows platform developer tools
  • Increased fault and anomaly detection by auditing and augmenting releng monitoring and metrics gathering
  • Simplified the build/test architecture by creating a unified releng API service for new tools
  • Developed a disaster recovery and business continuation plan for 2015 (with RelEng)
  • Researched bare-metal private cloud deployment and produced a POC

Developer Services

  • Ship Mozreview, a new review architecture integrated with Bugzilla (with A-team)
  • Massive improvements in hg stability and performance
  • Analytics and dashboards for version control systems
  • New architecture for try to make it stable and fast
  • Deployed treeherder (tbpl replacement) to production
  • Assisted A-team with Bugzilla performance improvements

I’d like to thank the team for their hard work. You are amazing, and I look forward to working with you next year.

At the start of 2015, I’ll share our vision for the coming year. Watch this space!

Try server performance

In Q4, Greg Szorc and the Developer Services team generally have been working on a headless try implementation to solve try performance issues as the number of heads increases. In addition, he’s made a number of performance improvements to hg, independent of the headless implementation.

I think these graphs speak for themselves, so I’m just going to leave them here.

Try queue depth (people waiting on an in-flight push to complete before their push can go through):

Try push median wait times:

(Thanks, too, to Hal Wine for setting up a bunch of analytics so we can see the effects of the work represented in such shiny graphs.)