History of Databases
February 3, 2021
Notes from: History of Databases (CMU Databases / Spring 2020) a lecture given by Andy Pavlo at CMU.
This a condensation of two papers
- What Goes Around Comes Around (Stone Breaker 2006)
- What’s Really New with NewSQL (Pavlo 2016)
Old Database issues are still relevant today, what’s changed is the hardware
SQL vs. NoSQL database is reminiscent of Relational vs CODASYL (netowrk model) in the 1970s. Spoiler: The relational model won.
We’re not addressing new ideas. Just new implementations enabled by new systems hardware that can support these ideas.
The First Database - IDS 1960s
The first database system was created in the 1960s by GE It was called IDS and it was based on the Network Data Model and implemented “Tuple-at-a-time” queiries. This is basically a bunch for loops.
IDS was created by Charles Bachman. Who then left and created IDMS at Culliane Database Systems.
CODASYL was created as a standard for COBOL programs to access the database.
The Network Data Model
You have collections of data You also have to define these membership sets which show the relations between collections.
You write nested for loops that interates over the sets and sets of relations.
problems
- This leads to very complex queries with lots of traversal code
- These database where easy to corrupt. Diffcult to repair the membership sets.
IBMs IMS
- IMS - Information Management System
- Built to support the Apollo purchase orders
- Hierarchical data model
- Programmer defined physical storage format (btree, heap, hashtable, whateve)
- Lots of forloops doing single tuples.
Problems:
- Lots of data duplication.
- No Independance between data and the data format
Relational Model (1970s)
- Invented by Ted Codd Database abstraction to avoid this maintence
- Store database in simple data structures (relations aka tables)
- Access data through high-level language (relational algebra - later became SQL)
- Physical Storage left up the implemenation: Queries will be imdependant of data layout on disk.
Adds the idea of JOIN to relate Supplier, Supply and Part tables Don’t need to iterate through tuples anymore!
Early Implementations
- System R IBM research - most famously Jim Gray
- Ingres - UC Berkeley - Stonebraker later creates Postgres (post-ingres)
- Oracle - Larry Ellison
Jim Gray and Stonebraker are both turning award winners… Larry is super rich!!!
The relatinal model wins and many products are created (DB2, informix, Oracle, Sybase, et al.)
All these implement SQL
Postgres today is dervived from the Stonebraker postgres code.
Object Oriented Databases (1980s)
Avoid “relational-object impedance mismatch” by tightly coupling objects and database.
Examples: Versant, ObjectStore, MarkLogic
Still exist today but there was no standard way to program these systems (no SQL)
Developed technologies to store data exmample JSON, XML fields.
Downsides
- Agregations are hard - leads to complex queries.
- No standard API
1990s Boring
No new inventions in database technolgies
New database products were created:
- SQLServer (a fork of sybase)
- MySQL writen as a replacement for mSQL
- Postgres gets SQL support
- SQLLite is created by someone in NC
2000s Internet boom
Big databases (Oracle, MsSQL) are too expensive to run at internet scale
OpenSource systems MySQL and Postgres did not support key features like Transactions
Companies wrote their own code to shard out accross database systems.
To bring the shards together many data wharehouses were created examples: Netezza, ParaaCcel, Greenplum, Vertica
Many were postgres forks.
These were
- scale out shared nothing
- closed source
- based on column stores.
2010s NoSQL Systems
Transiation to focus more on availablity rather the correctness
What is lost:
- Transactions
- SQL
Most famous: MongoDB, Cassandra,
A counter movement called NewSQL was created. Lets not give up the relational model, SQL or transactions. (Spanner class databases)
2010s Hybird Systems
HTAP - Hybrid Transaction Analytical Processing
Execute fast OLTP like a NewSQL but support OLAP like a Data wharehouse.
Examples: MEMSQL, Sap Hana, HyPer (tablue), Splice Machine, Peloton (dead)
2010s Cloud Systems
Database sytsems which are designed to run in the Cloud.
There are an evolution of the DBaaS concept and assume a “cloud” environment
- Slow Disks
- Unreliable Systems
Examples: Snowflake, Fauna, Spanner, DynamoDB
2010s Shared Disk Engines
Use a shared filesystem to store data (HDFS, S3, GFS) rather then
These are typically append only storage sytsems.
This is what people talk about when creating a data lake.
Exmaples: Apache Drill, Google Big Table, Redshift
Can scale up the disk and compute independently.
2010s Graph Systems
Andy has never seen a compelling arguement for a Graph Database.
The Graph API is sort of interesting, but why impelement this in the DB?
The research shows its not clear that this is an advantage even for this use case vs relational databases.
Examples: Neo4j
2010s Timeseries Databases
Specialized systems for storing time.
Its a good example of how you can optimize your database for a specific dataset type.
Seems to get you some wins at the storage level.
These are still relational db, its just cheating for a specific type.
Examples: Clickhouse, InfluxDB
Andy is interested in Clickhouse
2010s Rise of speicialized databases
Still relational but target at a specific workload (embedded, Multi-Model, Hardware Accellerated)
Takeways
We want to study the tradeoffs these systems make.
Specialized Database Systems will expand the scope of features they support.
The relational model is a quality system and should not be discarded.