YugabyteDB Bringing Together the Best of Amazon Aurora and Google Spanner (Karthik Ranganathan)

February 3, 2021   

Notes from YugabyteDB: Bringing Together the Best of Amazon Aurora and Google Spanner (Karthik Ranganathan) a Quarantine Database Tech Talks (2020)

This talk is given by Karthik Ranganathan who is one of the authors of Cassandra from Facebook who is now a cofounder and CTO at Yugabyte.

Yugabyte intends to be a distributed SQL store - apache 2.0 license.

Logical Layers

  • Yugabyte SQL (YSQL) Postgres-Compatible Distributed SQL API the query layer is pluggable supporting YSQL and YCSQL API which supports Cassandra
  • DocDB Spanner-Inspired Distributed Document Store

Single Node Postgres provides

  • PostgreSQL Query Layer
  • PostgreSQL storage Layer

Lets extend this to

Yugabyte started in 2016.

  • RDS and Spanner had been released
  • Enterprises were not ready to take DB to cloud yet?

Decision #1

  • Should we build a opensource Aurora or Spanner?

Aurora is much more used then Spanner.

Comparing Auora and Spanner

  • Aurora inherits MySQL/PostgreSQL query enigine as such it supports the benifits of these things
  • Spanner does not support so much SQL things
  • Aurora is pretty much drop in for PostgreSQL or MySQL
  • Spanner is scale out and Syncrous but Aurora is not.

Yugabyte supports Async Replication between clusters

Decision basically use both.

Decision #2 Reuse or rewrite PostgreSQL?

They tried and failed to reuse PostgreSQL, but investing rewrite was too much work. Timely support impossible without reuse.

Blog on pushdowns is worth looking up. Search for “Five pushdowns Yugabyte”

They reused the postgres query layer and replaced he storage engine.

Decision #3: Raft for single-row linearizablity

  • Raft is easier to implement
  • Raft formalizes membership changes

Karthik is a little flippent on the raft implementation. Testing was hard.

NOTE: Learn raft.

Decision #4: Reuse or modify RocksDB for per-node storage

Background was using RocksDB from facebook.

Uses a heavily modified rocksDB

Key to document store WAL log of RocksDB is not used (raft log) MVCC performed at a higher layer

Q: Why store as a document rather then a tuple?

A:

  • We can support short range scans in a document
  • Allows deletion on a lot of KVs in a single operation
  • The DocDB does not know anything about schema
  • This is a loose notion of a document. Its more like a specialized tuple.

“There is a relationship between a sequence of these tuples”

For denormalization…

Distributed (aka multi-shard) transactions

Any thing that does operations accross multiple tablets

There is a fast path for single row operations.

BEGIN TXN UPDATE k1 UPDATE k2 COMMIT

k1 and k2 may belong to different shards belong to different raft groups on completely different nodes.

Updates should get written at the same “time” and the nodes need to agree on time.

Spanner uses an atomic clock which is synced within 7ms

NTP is ~150-200ms

Solution: Hybrid Logical Clock

Combine coarsely-syncronized physical clocks with lamport clocks to track causal relationships: (physical component, logical component)

Node update HLC

Based on the same clock used by CockRoachDB

Decision 6: Distributed Transactions Algorithm?

Used a Google Spanner type design based on 2PC uses a HLC

  • More scalability
  • Better for multi-region deployments
  • Does not require atomic clocks

-> Clouds are adding timesync services (truetime for google)

Fully Decentralized Transacations

  • No single point of failure or bottleneck
  • Transactaition status table distributed across multiple nodes
  • Transactions have 3 states (commited, aborted, pending)

Isolation Levels support by Yugabyte:

  • Serializable
  • Snapshot Isolation (Reaptable Read, Read Committed & Read Uncommited)

Decision #7 - Async “xCluster” Replication

xCluster Replication: Allow 2 or more YugaByte clusters to consume changed-data from each other (assuming the participant tables have the same schema)

Arichtecture also enables change data capture - external systems can subscribe to and consume changed-data from YugaByte DB tables.

Decided againist exposing conflict resolution to users

Testing YugaByte:

  • Long CI/CD pipeline.
  • Typical CI/CD

Spark based parallel testing. The unit tests took many hours to run and needed to be run in parallel YugaByte is written in C++ and even supports MacOS They run e2e blackbox test as well. Perf Regression Testing - hard to do on cloud machines. Allow drops of 10%

They Run Tests for PostgreSQL, YugabyteDB, Raft, RocksDB so crazy.

They have ported a little over half of the PostgreSQL test framework.

Products

  • Platform Product that you install into your Project
  • Managed Cloud Offering in Free Tier (beta)

Citis vs. Yugabyte?

  • In functional spirit they are closer to cockroach then citis.
  • Citis uses the lower half of postgres but replaces the top end.

How stupid are your users?

  • Don’t go about asking people what they want. People don’t understand the limits of physics. Never build a different API then what the world knows. You get bogged down in communicating them. They had 3 API’s originally. One was the Redis API! That confused users a lot.