Notes on 5 Years of Scaling Rails - Simon Eskildsen

September 20, 2017   

These are pre-interview notes I took from watch some talk by Simon Eskildsen before I joined shopify. Originally this was a draft but reading it was sort of an interesting history of shopify so I switched it to publish.

This remains in “blog” topic because its interesting for me to have a log of some pre-interview notes and its not actually techincally useful information.

Notes on Shopify in 2017

  • 377,500 Shops
  • 1900 Employees
  • 2 Data Centers
  • 40 Daily Deploys
  • 80,000 Peak requests per second, 20K,40K Steady
  • $29 Billion in transations
  • Ruby since 2006

Double every year… Needs to be at 750,000 shops in 2018?

Stuff:

  • 80% of Traffic is Browsing storefront
  • Checkout - Heavy writes doing transations
  • Admin - Change Billing, Update inventory
  • API - Fast ADMIN operations

Flash Sales is a big challenge, a single customer can drive as much traffic as the rest of the platform. Need to be ready for these!

Built Team focused on fixing “flash sale” problem - done 5 years ago.

Being able to handle flash sales means shopify will be ready to scale to meet transition volumes next year.

Major Infra Projects:

2012 - Optimization for flash sales

  • Do basic application logic optimization.
  • Implement BackGround Checkouts - techincal debu
  • Inventory Optimizations
  • Clean up hot spots accross 100’s of queries
  • Wrote a load testing tool, full production integration testing, gives a feedback loop
  • Created a library called IdentityCache
  • Problme one mysql was hosting 1000’s of shops. Tried read slaves and failed.
  • Used the idea of IdentityCache to get data from memcache rather then DB. Problem - cache invalidation is hard

2013 - Database Sharding

  • To much optimization made the application hard to work with. Required Experts. - Couldn’t Optimize anymore.
  • Needed to Shard the database to scale writes anyway. You need to take writes to do transactions
  • Created a wraper that put developers in the correct shard context
  • Draw backs of this is that can’t do things like: Join accross shops, ad-hoc queries accross all shops.
  • If you can avoid it - Don’t SHARD!!! Took a full year to do.
  • You could shard at the DB level so you don’t need to do it at the application level. But relational database is a good benifit.
  • Didn’t have the experience to write a proxy and were unable to find a magic database.

2014 - Investing in Resiliency

  • At scale you will have more failures, more interactions between components
  • Tightly knit comonents will break availablity
  • Slow compents will break the system!!!
  • Need to take this mantra: Single Compent Failure Should not be able to compromise the performance or availablity of the entire system.
  • Create a Resiliency Matrix
  • Learned they were highly exposed to various systems
  • Wrote a tool to simiulate network problems (Toxiproxy) and observed
  • Created Shopify/Semian - a library which helps your application be more resilant.
  • Lots of Debt - not paid attention to for 10 years!

2015 - Multi-DC

  • Need to be ready to move to the other datacenter should one fail
  • Shopfiy failover between DC if pushbutton
  • Workflow:
  1. Failover Traffic - Start directing new traffic to new DC
  2. Read-Only Shopify Traffic is is going to new DC but is read only
  3. Failover Database move the writer for all shards to the new primary DC
  4. transfer background jobs

2016 - Active-Active DC

  • MySQL was sharded but Redis, Memcache, Workers, Load Balancers were shared
  • Break Shopify into many shared “POD"s (AKA cells)
  • Could split PODs accross DC’s
  • needed to add a service called sorting hat to decide where to send requests
  • two rules for Multi-DC ** Any request must be annotated with a pod or shop ** Any request can only touch one pod
  • Lots of code violated this rule (search all the shops for paypal plugs)
  • Started using shitlist driven development, a wrapping which raised an exception when rules were broken

Shopify has a single master metadata DB.

  • Read Slaves at each Datacenter
  • Lower SLO

A single store can use 60-80% of DC capcity