Hackbox

Scalable Backend Performance: Microsoft's Hackathon App Handles 60k Requests Per Hour
Background

When Microsoft needed a reliable, scalable internal web application built quickly, they turned to Formidable. The app, called Hackbox, is a web application for easily running hackathons within Microsoft. Hackbox is used to manage the annual company-wide hackathon, //oneweek, and also supports custom hackathons both big and small at any time.


Challenge
  • Build a reliable hackathon app that handles intense periods of traffic
  • Unexpected executive attention drew more traffic than anticipated
Solution
  • An Express front-end, running as an Azure App Service
  • A Node.js/hapi.js API layer, running as an Azure App Service
  • MySQL running on a Bitnami-provided Ubuntu Azure VM
  • Azure load balancing for the API layer
Results
  • 6.5 million API requests in a 2.5 week period
  • Up to 60,000 requests per hour
  • 99.9% API availability
  • API requests had a TP90 of less <300ms
Benefits
  • Rates of at least 300k reqs/hour (5k reqs/min) can be handled gracefully
Challenge

When Formidable joined the project, there was a prototype as a starting point. However, the prototype would not handle the scale of a Microsoft-wide event, so Formidable took on the backend to ensure minimal load times in periods of extremely high usage.

Shortly before launch, Hackbox caught the attention of Microsoft CEO Satya Nadella, which meant that there would be higher-than-anticipated traffic. We needed the best reliability we could manage under short notice.

Solution

Specs

Hackbox is built entirely in Azure for flexibility and scalability. It consists of:

  • An Express front-end, running as an Azure App Service
  • A Node.js/hapi.js API layer, running as an Azure App Service
  • MySQL running on a Bitnami-provided Ubuntu Azure VM

API Infrastructure

During development, the API lived on a single small Azure instance. For peak traffic, the API layer was fluidly scaled to 5 Azure S3 instances (Standard - Large; 4 cores, 7GB RAM, 50GB storage). The API layer uses Azure load balancing to distribute traffic, and can be scaled to more or fewer instances using the Scale Out feature, as well as moved to larger or smaller instances using the Scale Up feature. To minimize variables, we disabled automatic Scale Out and instead left 5 instances running at all times. Each instance is configured to use a maximum of 100 database connections.

The MySQL VM is a single DS12_V2 Standard Azure VM (28GB RAM, 200GB local SSD). MySQL is configured to allow 505 database connections: 100 per instance, and 5 for direct access for admin or debugging work.

Results

Traffic Served

Over the course of the peak period, the API saw the following traffic served:

  • Unique Users: 47912
  • User Sessions: 201490
  • API Requests* (24 July 1800 - 30 July 1700): 3.4m
  • (24 July 1800 - 12 August 1700): 6.59m
  • An API request is logged whenever the application queries the API, i.e. whenever data is required for a new page, or a user takes an action, such as performing a search or editing data.

Traffic was not evenly distributed during this time period. At different times of day, traffic would range from fewer than 1,000 reqs/hour to greater than 60,000 reqs/hour, with request-per-minute peaks at times exceeding 1.2k reqs/minute.

The below graph of July 25 2016 traffic illustrates daily traffic patterns.

daily traffic patterns

Errors & Performance

Of the 6.59 million requests served over the period in question, approximately 99.1% returned successfully (HTTP 2**). Requests that generated an error code were approximately 90% HTTP 4** (i.e. a request that was denied by the server for being unauthorized or incorrect) with remaining errors were HTTP 503, indicating a problem with the API layer, for a final API reliability of 99.9%.

Response times to API requests had a TP90 of <300ms, and a TP99 of <1s, indicating that the vast majority of users enjoyed a fast and responsive experience on the Hackbox webapp.

Further digging indicates that site performance was even better for normal users than these data would suggest. A particular script used by administrators of the site was written in such a way that it queried many pages of results simultaneously, before MySQL had returned its initial pagination query, forcing the database to run many full-table-traversal queries in parallel. An example of this script’s impact on site performance is below:

hackbox admin script

On the top graph, we see a moderately-busy hour of traffic. The bottom graph shows average response time over the same period. At approximately 9:09am, the average response time makes a huge jump to well over 10s, despite having no correlation to an increase in traffic. (In fact, traffic in the 9:08 - 9:10 interval is below the hourly average.) This peak in response time is directly attributable to the admin script being run.

Benefits

Capacity vs. Traffic

Viewing statistics from the busiest hour during peak period (~64k requests), we find that key metrics suggest the current setup has ample overhead for additional traffic.

  • API CPU usage: 8%
  • API TP90: <300ms
  • API Memory Usage: 7-12%, 11% average
  • Average DB connections: 83/500 (16%)
  • Database IOPS: 7% of max
  • Database CPU usage: 9% Conservatively, we estimate that peak loads of five times the event peak traffic can be handled with no or minimal change to the current setup. In other words, we expect that rates of at least 300k reqs/hour (5k reqs/min) can be handled gracefully.
Conclusion

Formidable has ensured that Microsoft hackathon participants have a reliable experience using Hackbox, even under heavy load conditions. Microsoft can be confident that Hackbox will scale with its hackathons.