History of Databases

February 3, 2021

Notes from: History of Databases (CMU Databases / Spring 2020) a lecture given by Andy Pavlo at CMU.

This a condensation of two papers

What Goes Around Comes Around (Stone Breaker 2006)
What’s Really New with NewSQL (Pavlo 2016)

Old Database issues are still relevant today, what’s changed is the hardware

SQL vs. NoSQL database is reminiscent of Relational vs CODASYL (netowrk model) in the 1970s. Spoiler: The relational model won.

We’re not addressing new ideas. Just new implementations enabled by new systems hardware that can support these ideas.

The First Database - IDS 1960s

The first database system was created in the 1960s by GE It was called IDS and it was based on the Network Data Model and implemented “Tuple-at-a-time” queiries. This is basically a bunch for loops.

IDS was created by Charles Bachman. Who then left and created IDMS at Culliane Database Systems.

CODASYL was created as a standard for COBOL programs to access the database.

The Network Data Model

You have collections of data You also have to define these membership sets which show the relations between collections.

You write nested for loops that interates over the sets and sets of relations.

problems

This leads to very complex queries with lots of traversal code
These database where easy to corrupt. Diffcult to repair the membership sets.

IBMs IMS

IMS - Information Management System
Built to support the Apollo purchase orders
Hierarchical data model
Programmer defined physical storage format (btree, heap, hashtable, whateve)
Lots of forloops doing single tuples.

Problems:

Lots of data duplication.
No Independance between data and the data format

Relational Model (1970s)

Invented by Ted Codd Database abstraction to avoid this maintence
Store database in simple data structures (relations aka tables)
Access data through high-level language (relational algebra - later became SQL)
Physical Storage left up the implemenation: Queries will be imdependant of data layout on disk.

Adds the idea of JOIN to relate Supplier, Supply and Part tables Don’t need to iterate through tuples anymore!

Early Implementations

System R IBM research - most famously Jim Gray
Ingres - UC Berkeley - Stonebraker later creates Postgres (post-ingres)
Oracle - Larry Ellison

Jim Gray and Stonebraker are both turning award winners… Larry is super rich!!!

The relatinal model wins and many products are created (DB2, informix, Oracle, Sybase, et al.)

All these implement SQL

Postgres today is dervived from the Stonebraker postgres code.

Object Oriented Databases (1980s)

Avoid “relational-object impedance mismatch” by tightly coupling objects and database.

Examples: Versant, ObjectStore, MarkLogic

Still exist today but there was no standard way to program these systems (no SQL)

Developed technologies to store data exmample JSON, XML fields.

Downsides

Agregations are hard - leads to complex queries.
No standard API

1990s Boring

No new inventions in database technolgies

New database products were created:

SQLServer (a fork of sybase)
MySQL writen as a replacement for mSQL
Postgres gets SQL support
SQLLite is created by someone in NC

2000s Internet boom

Big databases (Oracle, MsSQL) are too expensive to run at internet scale

OpenSource systems MySQL and Postgres did not support key features like Transactions

Companies wrote their own code to shard out accross database systems.

To bring the shards together many data wharehouses were created examples: Netezza, ParaaCcel, Greenplum, Vertica

Many were postgres forks.

These were

scale out shared nothing
closed source
based on column stores.

2010s NoSQL Systems

Transiation to focus more on availablity rather the correctness

What is lost:

Transactions
SQL

Most famous: MongoDB, Cassandra,

A counter movement called NewSQL was created. Lets not give up the relational model, SQL or transactions. (Spanner class databases)

2010s Hybird Systems

HTAP - Hybrid Transaction Analytical Processing

Execute fast OLTP like a NewSQL but support OLAP like a Data wharehouse.

Examples: MEMSQL, Sap Hana, HyPer (tablue), Splice Machine, Peloton (dead)

2010s Cloud Systems

Database sytsems which are designed to run in the Cloud.

There are an evolution of the DBaaS concept and assume a “cloud” environment

Slow Disks
Unreliable Systems

Examples: Snowflake, Fauna, Spanner, DynamoDB

2010s Shared Disk Engines

Use a shared filesystem to store data (HDFS, S3, GFS) rather then

These are typically append only storage sytsems.

This is what people talk about when creating a data lake.

Exmaples: Apache Drill, Google Big Table, Redshift

Can scale up the disk and compute independently.

2010s Graph Systems

Andy has never seen a compelling arguement for a Graph Database.

The Graph API is sort of interesting, but why impelement this in the DB?

The research shows its not clear that this is an advantage even for this use case vs relational databases.

Examples: Neo4j

2010s Timeseries Databases

Specialized systems for storing time.

Its a good example of how you can optimize your database for a specific dataset type.

Seems to get you some wins at the storage level.

These are still relational db, its just cheating for a specific type.

Examples: Clickhouse, InfluxDB

Andy is interested in Clickhouse

2010s Rise of speicialized databases

Still relational but target at a specific workload (embedded, Multi-Model, Hardware Accellerated)

Takeways

We want to study the tradeoffs these systems make.

Specialized Database Systems will expand the scope of features they support.

The relational model is a quality system and should not be discarded.