2014: Engineering Operations Year in Review

On the first day of Mozlandia, Johnny Stenback and Doug Turner presented a list of key accomplishments in Platform Engineering/Engineering Operations in 2014.

I have been told a few times recently that people don’t know what my teams do, so in the interest of addressing that, I thought I’d share our part of the list. It was a pretty damn good year for us, all things considered, and especially given the level of organizational churn and other distractions.

We had a bit of organizational churn ourselves. I started the year managing Web Engineering, and between March and September ended up also managing the Release Engineering teams, Release Operations, SUMO and Input Development, and Developer Services. It’s been a challenging but very productive year.

Here’s the list of what we got done.

Web Engineering

  • Migrate crash-stats storage off HBase and into S3
  • Launch Crash-stats “hacker” API (access to search, raw data, reports)
  • Ship fully-localized Firefox Health Report on Android
  • Many new crash-stats reports including GC-related crashes, JS crashes, graphics adapter summary, and modern correlation reports
  • Crash-stats reporting for B2G
  • Pluggable processing architecture for crash-stats, and alternate crash classifiers
  • Symbol upload system for partners
  • Migrate l10n.mozilla.org to modern, flexible backend
  • Prototype services for checking health of the browser and a support API
  • Solve scaling problems in Moztrap to reduce pain for QA
  • New admin UI for Balrog (new update server)
  • Bouncer: correctness testing, continuous integration, a staging environment, and multi-homing for high availability
  • Grew Air Mozilla community contributions from 0 to 6 non-staff committers
  • Many new features for Air Mozilla including: direct download for offline viewing of public events, tear out video player, WebRTC self publishing prototype, Roku Channel, multi-rate HLS streams for auto switching to optimal bitrate, search over transcripts, integration with Mozilla Popcorn functionality, and access control based on Mozillians groups (e.g. “nda”)

DXR

  • Modeless, explorable UI with all-new JS
  • Case-insensitive searching
  • Proof-of-concept Rust analysis
  • Improved C++ analysis, with lots of new search types
  • Multi-tree support
  • Multi-line selection (linkable!)
  • HTTP API for search
  • Line-based searching
  • Multi-language support (Python already implemented, Rust and JS in progress)
  • Elasticsearch backend, bringing speed and features
  • Completely new plugin API, enabling binary file support and request-time analysis

SUMO

  • Offline SUMO app in Marketplace
  • SUMO Community Hub
  • Improved SUMO search with Synonyms
  • Instant search for SUMO
  • Redesigned and improved SUMO support forums
  • Improved support for more products in SUMO (Thunderbird, Webmaker, Open Badges, etc.)
  • BuddyUP app (live support for FirefoxOS) (in progress, TBC Q1 2015)

Input

  • Dashboards for everyone infrastructure: allowing anyone to build charts/dashboards using Input data
  • Backend for heartbeat v1 and v2
  • Overhauled the feedback form to support multiple products, streamline user experience and prepare for future changes
  • Support for Loop/Hello, Firefox Developer Edition, Firefox 64-bit for Windows
  • Infrastructure for automated machine and human translations
  • Massive infrastructure overhaul to improve overall quality

Release Engineering

  • Cut AWS costs by over 70% during 2014 by switching builds to spot instances and using intelligent bidding algorithms
  • Migrated all hardware out of SCL1 and closed datacenter to save $1 million per year (with Relops)
  • Optimized network transfers for build/test automation between datacenters, decreasing bandwidth usage by 50%
  • Halved build time on b2g-inbound
  • Parallelized verification steps in release automation, saving over an hour off the end-to-end time required for each release
  • Decommissioned legacy systems (e.g. tegras, tinderbox) (with Relops)
  • Enabled build slave reboots via API
  • Self-serve arbitrary builds via API
  • b2g FOTA updates
  • Builds for open H.264
  • Built flexible new update service (Balrog) to replace legacy system (will ship first week of January)
  • Support for Windows 64 as a first class platform
  • Supported FX10 builds and releases
  • Release support for switch to Yahoo! search
  • Update server support for OpenH264 plugins and Adobe’s CDM
  • Implement signing of EME sandbox
  • Per-checkin and nightly Flame builds
  • Moved desktop firefox builds to mach+mozharness, improving reproducibility and hackability for devs.
  • Helped mobile team ship different APKs targeted by device capabilities rather than a single, monolithic APK.

Release Operations

  • Decreased operating costs by $1 million per year by consolidating infrastructure from one datacenter into another (with Releng)
  • Decreased operating costs and improved reliability by decommissioning legacy systems (kvm, redis, r3 mac minis, tegras) (with Releng)
  • Decreased operating costs for physical Android test infrastructure by 30% reduction in hardware
  • Decreased MTTR by developing a simplified releng self-serve reimaging process for each supported build and test hardware platforms
  • Increased security for all releng infrastructure
  • Increased stability and reliability by consolidating single point of failure releng web tools onto a highly available cluster
  • Increased network reliability by developing a tool for continuous validation of firewall flows
  • Increased developer productivity by updating windows platform developer tools
  • Increased fault and anomaly detection by auditing and augmenting releng monitoring and metrics gathering
  • Simplified the build/test architecture by creating a unified releng API service for new tools
  • Developed a disaster recovery and business continuation plan for 2015 (with RelEng)
  • Researched bare-metal private cloud deployment and produced a POC

Developer Services

  • Ship Mozreview, a new review architecture integrated with Bugzilla (with A-team)
  • Massive improvements in hg stability and performance
  • Analytics and dashboards for version control systems
  • New architecture for try to make it stable and fast
  • Deployed treeherder (tbpl replacement) to production
  • Assisted A-team with Bugzilla performance improvements

I’d like to thank the team for their hard work. You are amazing, and I look forward to working with you next year.

At the start of 2015, I’ll share our vision for the coming year. Watch this space!

Leave a Reply

Your email address will not be published. Required fields are marked *