Tech Talk: GOTO 2014 - Scaling Pinterest - Marty Weiner

From the YouTube video: GOTO 2014 • Scaling Pinterest • Marty Weiner
Slides are available on SlideShare

  • Evolution:
    • Started March 2010. Rackspace. 1 web enginer, 1 MySQL DB, 1 engineer + 2 founders
    • Jan 2011: AWS, 1 nginx + 4 app servers, 1 mysql + 1 read slave, 1 task queue + 2 workers (emails, etc), 1 MongoDB (counters), 2 engineers + 2 founders
    • Sept 2011: 2x growth every 45 days! More EC2, sharded DBs, 4 cassandra nodes, 15 membase nodes, 8 memcache, 10 redis, 3 task routers + 4 workers, 4 ElasticSearch, 3 MongoDB, 3 engineers, 8 total employees
    • LESSONS:
      • Everything will fail so keep it simple!
      • If you’re the biggest user of a tech, the challenges are greatly amplified (bc you need to fix the tech)!
    • April 2012: decided to rebuild everything and simplify the tech stack. DBs were a pain.
      • Lots of EC2. 135 web engines, 75 API engines, 80 MySQL DBs + 1 slave, 110 redis, 60 memcache. 12 engineers, 10 non-eng. Were able to scale out now.
  • Next problem: scaling people
    • April 2013: split up into individual teams. Data pipeline, search, biz and platform, spam, growth, infra + ops.
    • Then moved to San Francisco. Lots more people. More offices. Difficult to communicate.
  • Technologies:
    • ELB => software router => python web layer
    • Every user-generated image lives on S3, fronted by CDN like Akamai
    • Use Zookeeper to pair a web server (or whatever) with a backend service like search
    • Each DB tech (mysql, memcache, redis, hbase) fronted by a service
    • Data Pipeline: everything goes into Kafka. Read from Kafka and process the data.
  • Architecture
    • Pinterest Architecture
    • Pinterest Data Pipeline
  • Choosing your tech
    • Does it meet your needs?
    • How mature is it? (Maturity = blood and sweat / complexity)
    • Is it commonly used? Can you hire people that know it?
    • Is the community active?
    • How robust is it to failure?
    • How well does it scale? Will you be the biggest user?
    • Does it have good debugging tools? Profiler? Backup software?
    • Is the cost justified?
    • Is it simple?
  • Why AWS?
    • Veriety of servers running linux
    • Very good perifpherals: load balancing, DNS, map reduce, security, etc
    • Reliable
    • Active dev community,
    • Not cheap, but new instances ready in seconds
    • Route 53 for DNS, ELB as 1st-tier LB, EC2 Ubuntu Linux, S3 for images and logs
  • Why Python?
    • Mature, well known and liked, solid community, good libraries, rapid prototyping, open source.
    • Some Java and Go. Faster, lower variance in response times. For anything high-cpu.
  • Why MySQL and Memcached?
    • Very mature, well known and liked, rarely fails, respone time increases linearly, good support, solid community, open source
  • Why Redis?
    • Well know, good community, good performance, variety of data structures, persistence, open source.
    • Use for follower data, configurations, public feed pin ids, caching of mappings
  • Why HBase?
    • Small but growing community. Chosen bc it’s extremely fast non-volatile storage. Works well. Open source. Scalable.
    • BUT: Hard to hire for.
    • What happened to Cassandra, Mongo, ES, Membase?
    • Didn’t pass the set of questions asked under ‘choosing your tech’
    • Seems like they weren’t mature, were buggy, hard to operate, etc.
  • What would I have done different if I could?
    • Logging on day 1 (statsd, kafka, map reduce). Log every request, event, signup. Basic analytics. Recovery from corruption / failure. Kafka -> S3 -> MapReduce.
    • Alerting on day 1
    • Shard much earlier. Read slaves = time bomb.
    • Don’t rely on NoSQL in the early days.
    • Pyres for background tasks on day 1
    • Hire technical ops eng earlier
    • Chef / Puppet earlier
    • Unit test earlier (Jenkins for builds)
    • A/B testing earlier: decider on top of zookeeper; progressive rollout; kill switch
  • Looking Forward
    • More than 400 people
    • Continually improve pinner experience, and collaboration
  • Have fun, build a good culture, make sure employees are happy