How I Nearly Ruined Our Launch

I’m co-founder of a startup called MinoMonsters, we make an iOS game by the same name.  We went through YCombinator in the Winter 2011 batch.  On December 18th, the eve of our big launch, I nearly fucked everything up.  This is the story of what happened.

MinoMonsters is a monster battling game.  Think of it as Pokémon for the iPhone.  As is the case with many apps, our game is internet connected.  During the course of play, the app talks to a backed server. The server diligently performs tasks like finding your friends, letting you purchase in-game items, and letting you play with others.  When planning our server, we needed a solution that was quick to build, had fast performance, and was light over the wire.  We chose to build the server in Ruby with EventMachine and Synchrony speaking in protocol buffers over TCP. 

Since our soft launch on December 6th, the server has been chugging away without fault.  But then the problems began to emerge.  It all started with intermittent crashes.  I would wake up to find that the running EventMachine process had simply stop running.  The approach to understanding this problem is two-fold: recreate the crash to understand the “how” and capture a log to understand the “why”.  Since the crashes were intermittent, I had no way of predictably crashing the server.   So instead I started with logging.

First, by adding logging to my init script:

#!/usr/bin/env /home/ubuntu/.rvm/bin/ruby-1.9.3-p0
require 'daemons'

Daemons.run_proc(..., :backtrace => true) do
  ...
end

Next, by logging errors in EventMachine’s error_handler:

EM.synchrony do
  ...
  EM.error_handler do |e|
    MinoServer.log_error "Error raised during event loop: #{e.message}"
    MinoServer.log_error(e.backtrace.join "\n")
  end
  ...
end

Between these two, I figured I was covered.  The next step was to wait and watch it crash again.  As I monitored the server, I started to see a variety of errors across my application:

#Errno::EMFILE: Too many open files - content/627/inventory.plist
...
Error in .../new_relic/data_serialization.rb:27:in `read_and_write_to_file'
...
#LoadError: no such file to load -- strscan
...
HTTP request to https://graph.facebook.com/me returned HTTP 0, ""

I hadn’t found the source of the crashes, but I had found some issues.  Upon seeing these issues, my first gut reaction was that I must be running out of file descriptors.

$ ulimit -n
1024

Bingo.  Let’s fix that before doing anything else.

$ sudo sh -c '(echo "ubuntu soft nofile 40000" && echo "ubuntu hard nofile 40000") > /etc/security/limits.conf'
$ ulimit -n
40000

Then I methodically went through each error and devised a solution.  It wouldn’t solve my problem, but it would narrow down the options.  Over two weeks, I casually “fixed” more crashes as they cropped up and eventually they stopped.

Then the day came.  We had planned to set the price of our game to free.  As soon as we did, users started flooding in.  And then the problems started anew.

As before, the server was crashing.  I couldn’t recreate the crash, but I didn’t need to.  At first, it was crashing every hour.  Then it was every 30 minutes.  Soon enough, the server crashed within seconds of starting up.  Logs became increasingly useless, as the places in code that threw exceptions were random at best.  Something was falling apart and I couldn’t figure out what it was.  Without a backend, we couldn’t verify virtual goods purchases.  And without virtual good purchases, we would make no money off the crazy initial traction we received.  This was bad news.

I spent four hours trying everything I knew to solve the problem.  I scoured the web for obscure bugs.  I completely ripped out Synchrony.  I triple checked OS resources like file descriptors, disk space, and others.  I was out of ideas.  That’s when I sent out a distress beacon on Hacker News.

Within minutes I had dozens of people helping me get connected with the right Rubyists.  In the end, two people helped me out:  Ryan Stout and Aman Gupta.  With Ryan, we immediately started with generic diagnosis steps.  What does the server do?  What systems does it touch?  We systematically went through each touchpoint and observed it’s potential contribution to the problem.  This helped us eliminate any external factors (OS, Ruby, Library) as the culprit.  Then we got Aman involved.  As one of the maintainers of EventMachine, he was able to quickly start looking at my configuration.  Within minutes, he noticed something odd in my setup code:

EM.synchrony do
  EM.epoll
  EM.start_server "0.0.0.0", 8090, MinoServer
  EM.error_handler do |e|
    MinoServer.log_error "Error raised during event loop: #{e.message}"
    MinoServer.log_error(e.backtrace.join "\n")
  end
  
  MinoServer.initialize_server
end

This code was taken near-verbatim from an Ilya Grigorik post on EventMachine from 2008.  The code from that post originally looked like this:

EventMachine::run {
  EventMachine.epoll
  EventMachine::start_server("0.0.0.0", 8080, Handler)
  puts "Listening..."
}

Aman saw the issue right away.  EventMachine.epoll is a configuration method that must be called before the run loop.  From the documentation:

Call the method EventMachine#epoll anytime before you call EventMachine#run, and your program will automatically use epoll, if available

It turns out that this basic setup code that I copied from a 3-year-old blog post was incorrectly implemented.  If I had been more careful about reading the documentation, it’s possible I might’ve caught this issue myself.  The fix is easy:

EM.epoll
EM.synchrony do
  EM.start_server "0.0.0.0", 8090, MinoServer
  EM.error_handler do |e|
    MinoServer.log_error "Error raised during event loop: #{e.message}"
    MinoServer.log_error(e.backtrace.join "\n")
  end
  
  MinoServer.initialize_server
end

I made that change, deployed, and crossed my fingers.  Within minutes we knew that it worked.  Users were connecting en masse.  Thousands of monster battles started to ensue.  Virtual goods transactions began flowing.  All was right with the universe once more.

Let’s look at what happened:

  • EventMachine uses Ruby’s select by default for network I/O.  This is a sane default
  • Ruby’s select performance degrades as the number of persistant connections goes up.  It has an upper limit of 1024 connections.
  • I was aware of both EventMachine’s default behavior and the limitations of select.  
  • Before the game launched I looked at the code to make sure I was making the configuration call to switch to epoll.  It was there.
  • I did not confirm that the configuration call was working.  I never inspected the running state of the app and double-checked that it was using epoll instead of select.

Ultimately, I failed to ensure that a critical scaling component was activated before going live.  Since being fixed, the rest of the system has run smoothly, resulting in thousands of concurrents and millions of monster battles.

Huge thanks to Aman Gupta, Ryan Stout, and the supportive Hacker News community for helping me come to a solution quickly.  With you all, I was able to take a multi-day problem and get it fixed in a matter of hours.  Thanks.

You can follow me on Twitter here

  1. teej-m posted this