Tech Talk: Scaling Instagram with Mike Krieger

From the YouTube video Scaling Instagram with Mike Krieger

  • Founded 2010. 2012: acquired, 4 engineers, 30MM MAU. 2015: 95 engineers, 300MM MAU.
  • DO THE SIMPLE THING FIRST! Don’t build things you don’t actually need right now.
  • Use “boring” technology that’s operationally quiet.
  • Tech stack:
    • Originally: nginx, redis, memcached, postgres, gearman, django.
    • Currently: nginx, cassandra, memcached, postgres, rabbitmq, django. Also unicorn, proxygen, thrift, scribe (Facebook tech)
  • Async Tasks (site scale)
    • Gearman (async task broker). Started with single host. Chose gearman bc easy to setup. Lasted for 1.5 years.
    • Scaled to 8 gearmon brokers, 400 app servers. Web requests became slow. No failover. Couldn’t enable persistence bc of crashes.
    • Do the simple thing next: chose celery + rabbitmq as a replacement for gearman. 60 ms mean response time dropped to 10 ms.
  • Code Deployment (team scale)
    • Initially: git pull + fabric (remote scripting tool)
    • Fabric parallel mode came out, helped when number of machines grew to 10+
    • Then they wrote a ‘rollout’ command to upload tarball to S3, pull it down on each machine, and restart the service. Useful for 1.5 years.
    • Then wrote Sauron: could lock a resource and deploy. Helped coordinate deploys. Lasted 1.5 years.
    • But: much cargo cult knowledge around how to deploy.
    • So: updated Sauron scripts to write that knowledge into scripts.
    • Next problem: people waiting on locks. So they extended Sauron for Jenkins integration.
    • LESSON: take a human procedure and at each stage, figure out what to automate. Also, do not automate things you don’t need yet!
  • Search (product scale)
    • MySQL isn’t good for regex / wildcard searching
    • V2: used Solr (Lucene-based)
    • V3: used ElasticSearch. Easy to setup, easier to scale out. But: ops problems surfaced later.
    • V4: moved to Unicorn (Facebook graph DB tech). Were able to then tweak their search algo to show results people wanted.
    • Kept iterating on search algo logic to improve results.
  • Lessons:
    • Do the simple thing first, until your scale / team / product / changes
    • Then do the simple thing next
    • Ground your evolution in problem-solving