Notes on 5 Years of Scaling Rails - Simon Eskildsen
September 20, 2017
These are pre-interview notes I took from watch some talk by Simon Eskildsen before I joined shopify. Originally this was a draft but reading it was sort of an interesting history of shopify so I switched it to publish.
This remains in “blog” topic because its interesting for me to have a log of some pre-interview notes and its not actually techincally useful information.
Notes on Shopify in 2017
- 377,500 Shops
- 1900 Employees
- 2 Data Centers
- 40 Daily Deploys
- 80,000 Peak requests per second, 20K,40K Steady
- $29 Billion in transations
- Ruby since 2006
Double every year… Needs to be at 750,000 shops in 2018?
Stuff:
- 80% of Traffic is Browsing storefront
- Checkout - Heavy writes doing transations
- Admin - Change Billing, Update inventory
- API - Fast ADMIN operations
Flash Sales is a big challenge, a single customer can drive as much traffic as the rest of the platform. Need to be ready for these!
Built Team focused on fixing “flash sale” problem - done 5 years ago.
Being able to handle flash sales means shopify will be ready to scale to meet transition volumes next year.
Major Infra Projects:
2012 - Optimization for flash sales
- Do basic application logic optimization.
- Implement BackGround Checkouts - techincal debu
- Inventory Optimizations
- Clean up hot spots accross 100’s of queries
- Wrote a load testing tool, full production integration testing, gives a feedback loop
- Created a library called IdentityCache
- Problme one mysql was hosting 1000’s of shops. Tried read slaves and failed.
- Used the idea of IdentityCache to get data from memcache rather then DB. Problem - cache invalidation is hard
2013 - Database Sharding
- To much optimization made the application hard to work with. Required Experts. - Couldn’t Optimize anymore.
- Needed to Shard the database to scale writes anyway. You need to take writes to do transactions
- Created a wraper that put developers in the correct shard context
- Draw backs of this is that can’t do things like: Join accross shops, ad-hoc queries accross all shops.
- If you can avoid it - Don’t SHARD!!! Took a full year to do.
- You could shard at the DB level so you don’t need to do it at the application level. But relational database is a good benifit.
- Didn’t have the experience to write a proxy and were unable to find a magic database.
2014 - Investing in Resiliency
- At scale you will have more failures, more interactions between components
- Tightly knit comonents will break availablity
- Slow compents will break the system!!!
- Need to take this mantra: Single Compent Failure Should not be able to compromise the performance or availablity of the entire system.
- Create a Resiliency Matrix
- Learned they were highly exposed to various systems
- Wrote a tool to simiulate network problems (Toxiproxy) and observed
- Created Shopify/Semian - a library which helps your application be more resilant.
- Lots of Debt - not paid attention to for 10 years!
2015 - Multi-DC
- Need to be ready to move to the other datacenter should one fail
- Shopfiy failover between DC if pushbutton
- Workflow:
- Failover Traffic - Start directing new traffic to new DC
- Read-Only Shopify Traffic is is going to new DC but is read only
- Failover Database move the writer for all shards to the new primary DC
- transfer background jobs
2016 - Active-Active DC
- MySQL was sharded but Redis, Memcache, Workers, Load Balancers were shared
- Break Shopify into many shared “POD"s (AKA cells)
- Could split PODs accross DC’s
- needed to add a service called sorting hat to decide where to send requests
- two rules for Multi-DC ** Any request must be annotated with a pod or shop ** Any request can only touch one pod
- Lots of code violated this rule (search all the shops for paypal plugs)
- Started using shitlist driven development, a wrapping which raised an exception when rules were broken
Shopify has a single master metadata DB.
- Read Slaves at each Datacenter
- Lower SLO
A single store can use 60-80% of DC capcity