Notes on 5 Years of Scaling Rails - Simon Eskildsen

September 20, 2017

These are pre-interview notes I took from watch some talk by Simon Eskildsen before I joined shopify. Originally this was a draft but reading it was sort of an interesting history of shopify so I switched it to publish.

This remains in “blog” topic because its interesting for me to have a log of some pre-interview notes and its not actually techincally useful information.

Notes on Shopify in 2017

377,500 Shops
1900 Employees
2 Data Centers
40 Daily Deploys
80,000 Peak requests per second, 20K,40K Steady
$29 Billion in transations
Ruby since 2006

Double every year… Needs to be at 750,000 shops in 2018?

Stuff:

80% of Traffic is Browsing storefront
Checkout - Heavy writes doing transations
Admin - Change Billing, Update inventory
API - Fast ADMIN operations

Flash Sales is a big challenge, a single customer can drive as much traffic as the rest of the platform. Need to be ready for these!

Built Team focused on fixing “flash sale” problem - done 5 years ago.

Being able to handle flash sales means shopify will be ready to scale to meet transition volumes next year.

Major Infra Projects:

2012 - Optimization for flash sales

Do basic application logic optimization.
Implement BackGround Checkouts - techincal debu
Inventory Optimizations
Clean up hot spots accross 100’s of queries
Wrote a load testing tool, full production integration testing, gives a feedback loop
Created a library called IdentityCache
Problme one mysql was hosting 1000’s of shops. Tried read slaves and failed.
Used the idea of IdentityCache to get data from memcache rather then DB. Problem - cache invalidation is hard

2013 - Database Sharding

To much optimization made the application hard to work with. Required Experts. - Couldn’t Optimize anymore.
Needed to Shard the database to scale writes anyway. You need to take writes to do transactions
Created a wraper that put developers in the correct shard context
Draw backs of this is that can’t do things like: Join accross shops, ad-hoc queries accross all shops.
If you can avoid it - Don’t SHARD!!! Took a full year to do.
You could shard at the DB level so you don’t need to do it at the application level. But relational database is a good benifit.
Didn’t have the experience to write a proxy and were unable to find a magic database.

2014 - Investing in Resiliency

At scale you will have more failures, more interactions between components
Tightly knit comonents will break availablity
Slow compents will break the system!!!
Need to take this mantra: Single Compent Failure Should not be able to compromise the performance or availablity of the entire system.
Create a Resiliency Matrix
Learned they were highly exposed to various systems
Wrote a tool to simiulate network problems (Toxiproxy) and observed
Created Shopify/Semian - a library which helps your application be more resilant.
Lots of Debt - not paid attention to for 10 years!

2015 - Multi-DC

Need to be ready to move to the other datacenter should one fail
Shopfiy failover between DC if pushbutton
Workflow:

Failover Traffic - Start directing new traffic to new DC
Read-Only Shopify Traffic is is going to new DC but is read only
Failover Database move the writer for all shards to the new primary DC
transfer background jobs

2016 - Active-Active DC

MySQL was sharded but Redis, Memcache, Workers, Load Balancers were shared
Break Shopify into many shared “POD"s (AKA cells)
Could split PODs accross DC’s
needed to add a service called sorting hat to decide where to send requests
two rules for Multi-DC ** Any request must be annotated with a pod or shop ** Any request can only touch one pod
Lots of code violated this rule (search all the shops for paypal plugs)
Started using shitlist driven development, a wrapping which raised an exception when rules were broken

Shopify has a single master metadata DB.

Read Slaves at each Datacenter
Lower SLO

A single store can use 60-80% of DC capcity