Cassandra The Definitive Guide

February 6, 2021

Chapter 1 - Beyond Relational Databases

Details whats difficult about scaling relational databases

Relational databases are reviewed.

ACID

Atomic: writes are all or nothing
Consistent: The database moves from one correct state to another.
Isolated: Transactions executing concurrently execute in their own “space”. Transactions cannot modify the same data
Durable: Once data is written it cannot be lost.

2 Phase Commit

A transaction is done in a prepare and then commit phase to make it atomic
In a ditributed system this complex as the coordinator node must lock the resources on all hosts which have the data.
2 Phase Commit is also blocking

Recommended reveiw: Starbucks Does Not Use Two-Phase Commit" + Desiging Data Intensive Appplcations.

NoSQL stores succeed by relaxing ACID and working around not having two phase commit

Chapter 2 - Introducing Cassandra

Operational Costs of Master/Replica Page 19

By avoiding master/replica we get a simplier architecture how much is Shopify needing to paper over this situation with extra engineering cost for MySQL and Redis?

Tunable consistency Page 20-21

Think of Cassandra as tunably consistent rather then eventually consistent. You get to choose your consistency/availablity tradeoffs.

Strict Consistency: also known as sequential consistency. Every read will return the most recently written value. Requires a globally syncronized clock to determine the order of the writes.

Causal Consistency: Removes the global clock and attempts to use a more semantic approach. “Attempting to determine the cause of events to create some consistency in their order”

Weak (eventual) consistency: Writes will eventually propagate through the system.

Page 22: Cassandra chooses alway to be a writable opting to defer the complexity of reconciliation to read operations.

Replication factor determines how many nodes receive the write. The consistency level is how many nodes must ack the read/write.

Page 23: Cap Therom Review

Page 26: “Cap twelve years later” and that the pick two is somewhat missleading as systems such as cassandra have had advances in partition recovery such as hinted handoff and read repair.

Page 27: Cassandra is a “Partitioned Row Store”. Data is stored in sparse multidimensional hash tables. “Sparse” means that for any given row you can have one or more columns. Partioned means that each row has a unique partition key used to distribute the rows accross mutliple datastores. This is sometimes called a “wide column store”

Page 28: Originally Cassandra was schema free, but later CQL was adopted and the underlaying storage changed to to closely align with it. You can change schema online but its different the the earlier schema less thrift interface.

Page 30 - 31: When to use Cassandra

When you have a problem that will scale beyond a single instance DB
When you require a lot of writes
When you want Geolocation

Chapter 3 - Installing Cassandra

Untar the zip, have Java 8 or 11 either OpenJDK or OracleJDK

$CASSANDRA_HOME and $JAVA_HOME are needed.

-f will run cassandra in foreground

cqlsh can be told what to connect to via its args: cqlsh host port it will also use CQLSH_HOST and CQLSH_PORT environment variables.

cqlsh is case insensitive

Key spaces should be named with snake_case rather then camel

Text and varchar types are synonymous

Page 50: A table is created with a first_name, last_name primary key. Then later a select what a where = last_name is done. This fails with and error because it may have unpredictable performance… Very interesting.

Page 52 select count(*) from user also returns a warning this because this operation will be very expensive in cassandra.

Page 53: There are offical docker images: see “cassandra”

Chapter 4 - The Cassandra Query Language

Page 57: A cassandra row is a nested dictionary with the primary key providing a pointer to the row.

{ 1: { name: alex, age: 39, weight: 180}, 2: { name: sarah, age: 39 }, 3: { name: maddie, age: 16, weight: 125, height: 5.5} }

1,2,3 are primary keys pointing to each row which can any number of columns

Cassandra lacks NULL, you just don’t store that data.

“composite key” (or compound key) to represent groups of related rows, also called “partitions”

The composite key consists of a “partition key” plus and optional set of clustering columns.

The parition key determines which node will store a row and itself can consist of multiple columns.

clustering columnes are used for how data is sorted inside a parition. (see page 58)

a “static column” which is for storing data that is not part of the primary key but is shared by every row in the parition.

Page 59: column = a name value pair row = a contianer for columns referenced by primary key parition = group of related rows that are stored together on the same nodes table = container for rows organized by paritions keyspace = a container for tables cluster = container for keyspaces that spans one or more nodes

Page 61 - Data access requires a primary key

SELECT, INSERT, UPDATE and DELETE all operate in terms of rows.
INSERT and UPDATE all primary key columns must be specified using the WHERE clause so it can update on a specific row.
SELECT and DELETE can operate in terms of one or more rows within a partition, an entire partition or even multiple paritions by use the WHERE and IN clauses.

Page 62: Insert, Update and Upsert Since cassandara uses an append model there is no difference between insert and update. If you insert a row that is the same primary key as an existing row the existing row is replaced. This is called an Upsert.

Page 63: Columns - timestamps and time to live

Timestamps:

When you write to cassandra a timestamp in microseconds is generated for each column value inserted or updated. Internally cassandra users these timestamps for resolving any conflicting changes that are made to the same value. This is called “last write wins”
writetime(column) reports when this column was writen.
You cannot use the writetime() function on primary key columns
You can set timpstamps via the USING TIMESTAMP predicate

Time to Live (ttl) page 65

ttl defaults to null (never expire)
USING TTL will set the TTL on a column
the ttl(column) function will report it

page 66: TTL is stored per column for nonprimary key columns

CQL Types page 66

Numeric types as expected, they map the the native java equavlinets. Including a decimal type

Textual types:

text,varchar = the same a UTF-8 string
ascii = should only be used for legacy data.

Time and Identity types

timestamp - encoded as a 64bit signed int. Supports ISO 8601 dates as input
date, time added in 2.1 to support storing just the date or time of day independant of the timestamp.
uuid: 128 bit value uses a type 4 uuid. You can create a uuid via the CQL uuid() function
timeuuid: a type 1 uuid which is based on the mac of the computer the system time and a squence number to prevent duplicates. This is a “conflict free” timestamp. Theres a number of convience functions for this type now(), dateOf(), unixtimestampOf(). These helper funcions mean that timeuuid is more often used the uuid

Page 69: Primary keys are forever. You cannot change them without rewriting the data into a new table.

Other types:

boolean
blob - no validation is done on read/write of the blob contents
inet - ipv4 or ipv6 address
counter - 64bit signed int which can only be incremented or decrimented. This allows for race free incrementing cross datacenters.

Collection Types (page 72)

Rather then storing multple emails or addresses as say “email1”, “email2” you can place this data in a collection type.

set

stores a collection of elements, returns in sorted order, you can insert without reading first
you remove from set using the “-” and null the set with the empty set.

list

order list stored in order of insertion
elements are posted or prepended to the list using +
You can address elements in the set via th index with the [1] operator
lists tend to be expensive because updating or deleting an item requires traversing the whole list

map

contains a collection of key value pairs which can be any time besides counter

Tuples: Page 75

a Tuple is a fixed record. In practice not often used as the type names are not stored.

example:

address tuple<text, text, text, int> for (street, city, state and zip)

User-Defined Types - page 76

Similar to tuples but you can specify the value names for each position.

example:

CREATE TYPE address (
 street text,
 city text,
 state text,
 zip_code int);

User defined types are scoped to their keyspace and you can see them via the describe keyspace command

You cannot nest user defined types into collections without freezing them which serializes them into a blob.

Chapter 5 - Data Modeling

This is the hotel example from the Cassandra docs done again

Design Differences Between RBDMS and Cassandra (Page 83)

No Joins

You have to execute joins in the client or create a second denormalized table that represents the join
Its perferred to create the second table and joins in the client should be the rare case. Its perfered to duplicate the data

No referential Integrity

Lightweight transactions and batches exist but cassandra has no concept of refereiential integratiy accross tables.
You will likely have related ids between tables but ideas like casscading deletes and foriegn keys don’t exist

Denormalization: Cassandra will have tables with duplicated data.

Query First design: Model the queries first and let the data be organized around them. Think of the most common query paths your application will use and then create the tables that you need to support them.

Designing for optimal storage

how the data is stored on disk matters in cassandara
minimize the number of partiions that must be searched in order to satisfy a given query.

sorting is a design decision

The sort order on queries is fixed and determined entirely by the selection for clustering columns you supply in the CREATE TABLE command.
You can add ORDER by but its really just symbolic, you can only order in the order of the clustering columns

Page 88: Chebotko Diagrams are used to help you understand the design.

Page 90: They need to query over a range of dates so the date needed to be included in the clustering key.

Patterns and Anti-Patterns

“wide row/parition” - group multiple related rows in a parition in order to support fast access to multiple rows within the parition in a single query

“times series” - a series of measurements at a specific time intervale are stored in a wide parition where the measurement time is used as part of the partition key. This is typically used with sensor data. You could use this to store all writes to a customers bank account and then allow the application to calculate the customers balance. This rather then using update+transactions.

Its a bad idea to use the rows as queues. Don’t do it as it will pollute the LSM segments with tombstones which are terrible for performance.

Cassandra is not good at delete - keep this in mind

First we design the logical model then the physical model.

Using UUIDs as references Page 95

The authors indicate it may make sense to use UUIDs to identify the things such as hotel_id, for the purposes of simplicity they don’t and rather use a text field with a human generated hotel code in it. They do reocmmend you consider using uuid’s in the real world application to reduce coupling.

page 95 and 96 lay out the “physical” model of the hotels and reservations keyspaces.

Whats interesting is that they put these in two different keyspaces. Why? The authors note that the User Defined Types (UDTs) cannot be shared between keyspaces so you have to declare them twice.

Calculating Partition Size page 97

You should not make size a partition to be larger then 100,000 cells (max 2billion are supported)

To calculate the size of a partition:

Number of Static Columns + Number of Rows * (Number of values per row - Number of Primary Key Columns - Static Columns)

Tables can be altered so constraining the number of rows in a partition to keeping this sane

Calculating Size on Disk page 98

They changed the storage engine in 3.0 and the SSTable files take less space.

To break up large partitions add a column to the partition key. You can “bucket” the data into larger buckets for example adding yet another column with a courser partitioning eg replace a day with a month column

Chapter 6: The Cassandra Architecture

Cassandra is datacenter and rack aware.

Gossip protocol allows each node to keep track of state information about the other nodes. The Gossiper runs every second.

How Gossiper works (page 109)

Once every second the gossiper will contact a random node.
It will send its choosen friend a DigestSyn message
The friend will respond with DigestAck
The initiator after receiving the ack will respond with DigestAck2

If the endpoint does not respond it will be marked as dead (convicts) and placed in a local list.

Cassandra uses Accural failure detection.

A node is scored as to likelyhood of liveness “suspicion”. The algorithm is adaptive to problematic networks. The “Phi convict threshold” determines if the node has failed. This tunable is not linear. In general it will take Cassandra 10 seconds to determine if theres a failed node.

Snitches Page 110

Snitches are used to determine the cluster topology for read operations.

The “SimpleSnitch” is topology unaware.

Rings and Tokens

Cassandra represents the data managed by a cluster as a ring. Each node in the ring is assinged one or more ranges of data described by a token, which determines its position in the ring.

A node claims ownership of the range of values less than or equal to each token and greater then the last token of the previous node, known as a token range.

CQL provides a token() function that can be used to request the value of the tokn corresponding to a parition key.

Virtual Nodes or vnodes make it so the token range is broken up into smaller ranges and then each node is assigned multiple tokens. You can allocate more tokens to larger machines via the num_tokens config option.

Partitioners page 113

The partitioner determines how data is distributed to nodes.

The partitioner is a hash function for computing the token of a partition key.

In practice the partitioner is not often changed and cannot be changed after the cluster has been initialized.

Replication Strategies page 114

a node serves as a replica for different ranges of data. Cassandar replicas data accross nodes in a manner transparent to the user. The replication factor is the number nodes in your cluster that will receive copies of the same data.

Consistency Levels Page 115

You specificy the consistency level on each read and write query. It determines how many replicas must respond to the request for it to succeed.

To achieve strong consistency in cassandar Read replica Count + Write Replica Count > Replication Factor

Note: Replciation Factor and Consistency Level are not the same. The Replication Factor is set on the keyspace The Consistency level is set on the request.

Coordinator Node - the node which receives and services the read or write request.

Hinted Handoff page 118

When a node is offline and the coordinator node receives a write the coordinator can record a hint which is to say hold on to this write until the node has recovered. When the node is online it will send the write on.

Hinted writes do not count in for Consistency Level.

Hinted writes cannot be used for reads.

If too many hinted writes build up when the failed node comes online it can be overwhelmed/flooded with write requests for various coordinators. The number of hinted writes which can be stored in the coordinator node is limited to prevent this.

Anti-Entropy & Read Repair page 119

As reads happen disaggreing replica as brought back in sync. (read repair)

Anti-entropy or manual repair is initated by an operator.. It causes cassandra to run validation compaction

The nodes trade Merkle Trees with neighboring replicas to check their table data.

In a merkle tree every parent node is a hash of all its direct child nodes.

Lightweight Transactions and Paxos page 120

A lightweight transaction creates “linearizable consistency” which means no other client can modify the same cell we modify between our read-then-write operation.

LWTs uses and extension of Paxos to do this and avoid two phase commit. A LWT will require 4 round trips between the coordinator node and its replicas meaning it is very expensive.

LWTs are limited to a single parition.

Memtables, SSTables and Commit Logs Page 122

On write, first data is written to a commit log which will be used in crash recovery. The commit log is only read during node recovery operations after a crash. This is the “durable_writes” setting on the keyspace.

After writing to the commit log the write is inserted in the memtable; Which is a memory table containing data for a specific table.

When the memtable gets large it is dumped to disk into a file called an SSTable and a new memtable is started. Multiple memtables can exist in memory for single table with many waiting to be flushed.

After the memtable finishes flushing to disk a bit is inserted into the commit log to indicate that the data for that memtable in the commit log is nolonger needed.

All writes in cassandra are append operations and sequential.

Compaction performs a merge step of a merge sort on SSTables and removes the old files on success.

Bloom Filters Page 124

A bloom filter is an algorithm to determine if an element is a member of a set. It can provide false positives but not false negatives. On positive the value is searched in the SSTables.

This is used to reduce the disk access on key loops.

Caches page 125

Key Cache - Faster read access into SSTables on disk, stored in JVM heap

row cache - caches fequently accessed rows, stored off heap in the JVM

chunk cache - added in 3.6 to store uncompressed chunks of SSTables to speed up acces (off jvm heap)

counter cache - improves counter type performance

Row Caching is disabled by default.

Compaction page 125

Compaction occurs to merge SSTables, durring compaction the data in the SSTables are merged, columns are combined obsolet values are discarded and a new index is created.

An SSTable consits of three files the Data, Index and Filter.

On compaction a new SSTable is created from the merged tables.

Compaction improves performance by reducing the need to search old SSTables for values.

The compaction stragey used is set per table, examples are:

Size Tiered (STCS) the default best for write centric tables
Leveled (LCS) best for read tables
Time Window (TWCS) best for times series data

major compaction or full compaction is a feature in cassandra for consolidating multiple SSTables into a single table. The feature still exists but is not typically useful in newer releases of cassandra.

Deletion and Tombstones page 127

Data in cassandra is not deleted (the SSTables are immuable) but rather a tombstone is place in the memtable to indicate that the data is deleted. This is akin to a soft delete in a RDBMs

During compaction tombstones are removed once they pass the age of gc_grace_seconds which by default is 10 days.

If a failed node is done greater the gc_grace_seconds it should not be added back into the cluster as it will contain deleted data that will never be deleted.

System Keyspaces page 133

Cassandra stores metadata about itself in system keyspaces. these can be seen via describe tables.

Chapter 7 - Designing Applications with Cassandra

In this chapter the authors lay out a microservices based design for the hotels applcation from chapter 5. The authors recommend you read “Building Microservices by Sam Newman and Domain Driven Design by Eric Evans.

They focus on three ideas first

Encapsulation - mircoservices should not use the database as an integration point
Autonomy - Each microservice should be independantly deployable - each should have their own datastore
Scalablity - Different services shold be able to be scaled independantly.

You identify “Bounded Contexts” which are groups related enties. In the case of the Hotels application it is divided into two bounded contexts “Hotel Domain” and “Reservation Domain”

In each context you break the Domains into services which will each own specific groups of tables.

With cassandra a natural approach is to assign denormalized tables representing the same basic data type to the same service.

Pay attention to coupling and cohesion:

Tables owned by a service should have a high degree of relatedness (cohesion)
There should be a low degree of coupling between contexts

Representing database models in CQL

Key-Value models

The key is the parition key the remaining data can be stored in a value column as a text or blob type. Keep values at <5MB.

Document Models

Option One Flexible Schema: store nonprimary key columns in a blob. This is not a good solution.

Option Two use CQL Support for reading and writing data in JSON vis INSERT JSON, and SELECT JSON -> This requires that all the referenced attributes are defined in the table schema.

Graph Models

Some Datastax thing is recommended…

As you application evolves you will end up with more denormalzed views of the data.
These may get expensive to maintain and you may want to look into alternatives.

Secondary Indexes page 144

You can place a secondary index on a column to avoid using ALLOW FILTERING and reading all rows across all nodes.

Secondary Index Pitfalls page 147

Secondary indexes are slower and more expensive the normal queries, do not use Secondary Indexes with

Columns with High Cardinality
Columns with very low cardinality - fat rows/partitions
Columns that are frequently updated or deleted - “generate errors if the amount of deleted data builds up quicker then compaction can handle it

It is perfered to use denormalization or materialized views to Secondary Indexes

SASI: A new secondary index implementation page 147

Cassandra 3.4 added and experimental SI implementation called “SSTable Attached Secondary Index”
SASI indexes are calculated and stored as part of each SSTable file.
(the original SI impelemenation in cassandra stores SI’s in “hiddnen” tables.)
SASI’s support inequality and like based searchses

Materialized Views (page 148)

Materialized Views are the idea of having cassandra maintain the denormalized views of your data rather then doing it in your application.

They are crated with the “CREATE MATERIALZED VIEW” Command and they use a SELECT FROM WHERE to build the view.

There are a number of unfixed bugs with Materialzed views such as:

Limitations on the selection of primary key columns and filters
Aggreegates in materialzied views are not supported
Materialzied Views were a big part of the 3.0 release and drive a significant design change such as the new storage engine. Most of the major fixes in the 3.X release are around MV’s GLHF

NOTE As of 2018 TLP recommended disabling materalized views in production!!!

They have implemented the reservation service in Java you can read it at https://github.com/jeffreycarpenter/reservation-service

When you create microserices it is advised to use one keyspace per microservice

How do you maintain data integrity between microservices? (page 154)

Patterns

Add an orchestration service between the two data services. The orchestration service owns no datastore and interacts with the CRUD services to complete a bussiness process. The example in the book is of the booking service orchestrating between the reservation and inventory service.
Add a message queue such as kafka to create a stream of change events which will be consumed async by the downstream services. The inventory service might choose to subscribe to events related to reservations to make corresponding changes in invenotry. This is called “choreography”.

Both these patterns are distributed systems themselves and give tradeoffs in consistency to work around this you can try:

using a distributed transaction framework like Scalar DB
Using a analytics tool like Apache Spark to check & fix data as a background task
Use a CDC outbox pattern

The “KillrVideo” application is provided as an example. See github.com/killrvideo

Chapter 8 Application Development with Drivers

Cassandra has a history of drivers from its old thrift protocol. Most of these are now retired.

Datastax used to provide propreitry drivers but these were merged into the opensource drivers in 2019.

CqlSessions are expensive as they maintains TCP connections to multiple nodes. Use a single CSQLSession through your application and reusue it rather the setup/teardown. You should use a single session for keyspace with applications that use multiple keypspaces.

PreparedStatemtns are more efficent then SimpleStatments and more Secure

The driver is configured with a specific LoadBalancingPolicy to keep from sending all traffic to a specific coordinator node. The policy has these options

Round-Robin
Token Awareness - when using perpared statement the policy uses the token value of the parition key to select an optinamal node for the query (TokenAwarePolicy)
DataCenter Awareness The local DC must be configured in the driver then the driver will prefer nodes in this DC>

Theres settings for query retry (RetryPolicy and RetryDecision)

The driver supports Speculative Execution which can send the query to two nodes and see if one responds first.

By default the driver creates a single connection per node.

Node Discovery Page 178

The first node the driver connections to maintains a “control connection” which is used to maintain information about the state and topology of the cluster. This connection is used to discover the other nodes in the cluster.

Schema Access Page 179

The schema itself is eventually consistent and different nodes can see different schemas at different times.

Chapter 9 - Writing and Reading Data

Writes are sent with a consitsency level.

ANY - Allows for even a hint to count as a successful write
LOCAL_ONE - the node accepting the write must be in the local DC
LOCAL_QUORUM - majority of replicas in the local DC have recieved the write
EACH_QUORUM - QUORUM of nodes in each DC

The Write Path (page 187)

Coordinator sends simultaneous write requests to all local replicas.

If the cluster spans multiple DCs the local coordinator node selects a remote coordinator in each of the other DCs to forward the write to the replicas in that datacenter.

Each of the remote replicas acks the write directly to the original coordinator.

Once the coordinator recieves acks from the desired number of replicas it acks the write to the client.

Once a replica receives a write

The data is journalled to the commit log
The data is written to a memtable
The row cache is invalidated (if used)
If the memtable or commitlog is “filled” by the write a flush is scheduled to run
The write is then acked to the client
if needed the node executes the flush. The contents of the memtables are written to SStables on disk and the commit log is cleared.
Compaction tasks are scheduled and a compaction is done if needed.

This is the simplest write path which does not inolve the special counter type or materialized views.

Commit Logs Page 189

found in data/commitlog
named CommitLog-.log where version is an integer for the commit log format.. Eg for 4.0 its 7

SSTable files

found in data/data
Each SSTable files are put in a subdirectory here with a UUID
The filenames are ---.db
- version: two charcaters for the version of the SSTable format. 4.0 is “na”
- generation - an index number that is incremented every time a new SSTable is created for a table.
- implementation - a reference to the impelmentation interface in use. “big” is bigtable format

The SSTable consists of multiple file components

Data.db files that store the actual data these are the only thing which are backed up.
CompressionInfo.db - compression metadata on Data.db
Digest.crc32
Filter.db - The bloom filter for the SSTable
Index.db - row and column offsets within the corresponding *-Data.db file for reading the file
Summary.db - A sampel index for faster reads
Statistics.db - Store stats on the sstable
TOC.txt lists the file components for this SSTable

Lightweight Transactions Page 191

Common LWT Semantics

INSERT … IF NOT EXISTS, this bocks upsert to ensure a value is unique.
UPDATE IF used to make sure that a row has the expected value that cannot change before a write occurs. This is frequently used to manage inventory counts.

LWT can be used on schema creation. This is useful to script multiple schema updates.

LWTs select a serial consistency level

SERIAL default, a quorum of nodes must respond
LOCAL_SERIAL, a quorum o fnodes in the local dc must respond.

LWT are limited to a single partition.

Batches Page 194

Allows you to group multiple modifications into a single statement.

Only allows modification statments.
Can be Logged or Unlogged, logged have more safeguards
Batches ARE NOT TRANSACTIONS, but you can include LWTs in a batch. The LWTs in a batch must apply to the same partition
Counter modifications require a special batch known as a counter batch

Used for:

making multiple updates to a single partition
keeping tables in sync, for example making modifications to denormalized tables that store the same data for different access patterns

NOTE: Batches are not for bulk loading!!! They are actually slow

Logged batches are atomic (as such they are sometimes called atomic batches)

Batches are limited by a size parameter in KB in cassandara.yaml (50K is default)

Reading - Page 197

When reading higher consistency levels give you greater confidence you are reading the most recent data.

When multipe read values are sent to the coordinator and they have different timestamps the news value wins and is sent to the client. The out of date replica is updated in an operation called a read repair.

Read Consistency Levels

ONE, TWO, THREE - Return the record held by the first node(s) that respond to the query. The record is checked agiainst the same record on the other replicas. If any are out of date a read repair is preformed
LOCAL_ONE - similiar to one with the requirement tha thte reading node is in the local DC
QUORUM - Query all nodes, once a majority or replicas respond return to the client the value with the most recent timestamp. The if needed read repair any remaining replicas
LOCAL_QUORUM - similar to quorum where the responding nodes are in the local DC
EACH_QUORUM - QUROM across two DCs
ALL - Like it sounds, fail the read if any don’t respond.

The Read Path Page 199

The coordinator sends a read request to the fastest replica (perhaps itself). The fastest replica is determined by the snitch. A digest request is sent to the two other replicas.

A digest request will return a digest of the data rather then the actual data.

The digest on the data received from the replica is calculated and comparied vs the digest recieved from the other replicas. If the digests check out the data will be returned to the client. If the digests differ a read repair must be done.

The replica read order is

Row Cache
Key Cache
memtable
SSTables

There is only a single memtable for a given table, but there may be MANY SSTables for a single table.

Optimizations for searching SSTables:

A bloom fliter is used to determine if the requested partiion does not exist in a given SSTable. this fast fails on searching some of the tables.
Next the Key cache is checked to is if the offset in the file is cached. The key cache is a map structure in keys are a combination of the SSTable file describe and parition key. The values are offset locations into SStable files. This helps reduce seeks within the SSTable files.
Finally if no offset is found in the key cache a two-level index store on disk is used. The first level is the partition summary which is used to obtain an offset for searching for the parition key within the second level index aka “parition index” The parition index is where the offset into the SSTable for the partition key is stored.

The SSTable are read at the desired offset. Once data is obtained from all SStables cassandra merges the SSTable data by selecting the values with the latest timestamp for each requested column.

The result is cached in the row cache.

Read Repair page 201

On read repair the coordinator makes a full read request to all replicas.

The coordinator merges the data by selecting a value from each requested column.

Latest timestamp wins
Lexicographical Lottery if the timestamps are the same and the values differ

Read repair can be done AFTER returning data to the client on low consistency levels.

Transient Replicas page 202

You can add “transient replicas” to the cluster to catch writes when some replicas are offline. Writes are then sent back to the offline replicas when they recover. This is an experimental feature in the 4.0 release.

Transient replicas break: Read Repair, Batches, LWTs, counters, Secondary Indexes and Materlized views

Paging page 208

Paginating Results is supported in the newer versions using the driver. The limit clause cannot be used for pagination, it just returns a limited number of records

Deleting Page 208

When data is deleted a Tombstone is placed in the SSTable file indicate it has been deleted. This is to allow nodes that are down when data is deleted to receive learn that data has been deleted.

After the default of 10 days the data will be dropped from the SStables during compaction.

The more tombstones you have the longer it takes cassandra to read data.

Techniques to minimize delete impact:

Avoid writing NULL into your tables as this is interpreted as a delete.
Delete data at the larges granulatiry you can. Ideally an entire parition at once!
Excercise care when updating, avoid replacing the entire contents of a list, set or map. Instead update only the elements you need to modify
If you set TTL cassandra will expire data automatically
On time series data use the TimeWindowCompaction

Chapter 10 - Configuring and Deploying Cassandra

(skimmed)

Cassandra Cluster Manager (ccm) is a set of python scripts for running a cassandra cluster on a single machine.

A cassandra cluster has a name and that name is writen into the datafiles. If you start up the cluster with a different name it will fail to read the datafiles and shutdown.

Seed Nodes

A seed node is used as a contact point for other nodes so Cassandra can learn the topology of the cluster. That is what hosts have what ranges. Seed nodes are used for “bootstraping”

You should have at least two seed nodes per DC

Stiches Communicate the Cluster Rack/DC Layout

Simple - a flat cluster
PropertyFile - just hardcode the nodes into a file and copy it to all nodes
Gossip - Put the rack and dc value for each node in a file and the gossiper passes it around.
RackInffering - looks at IPs
DynamicEndpoint
Cloud Snitches (Ec2Snitch, Gcp…)

Chapter 11 - Monitoring

(skimmed)

JMX

MBeans

Nodetool (which uses Mbeans)

Virtual Tables

Logging

system.log INFO messages or Greater, cluster membership changes
debug.log - DEBUG and above. Flusing and compaction
gc.log - JVM GC, standard JVM log

There exists a full query log you can enable in 4.0 to see all the queries.

Chapter 12 - Maintenance

Simmed/Skipped - Lots of detail on repair.

Chapter 13 - Performance Tuning

Set a well defined performance goal that contains both a throughput and latency goal like:

30,000 read operations per second from table blah at 5ms 99th percential latency.

Cassandra Stress Page 304

cassandra-stress comes with cassandra
write and read options that produce some fun output.
You can define a yaml file to have cassandra-stress run on your own tables you’ve created.

tlp-stress

“The Last Pickle” similiar to cassandra-stress but with a simplier syntax.
Has built in workloads for command data models such as time series and key value.
Has options for testing secondary indexes, materialzied views and LWTs and ALLOW Filtering.

NoSQLBench

Was called DSBench
Created by DataStax
Extensable Tool for benchmarking complex workloads

“nodetool proxyhistograms” - helps find slow coordinator nodes “nodetool tablehistograms” - helps find large paritions

The bigest key to performance is table design!

You can configure individual nodes to trace some or all of their queires via the nodetool settraceprobablity

Chapter 14 - Security

Skimmed.

Chapter 15 - Migrating and Intergrating

Moving a RDBMS application to cassandra requires translating the data models.

You can do this indirectly by reverse engineering the application and determining its data access paths or you can do this directly by translating the databases entity relationship model.

When you directly translate the model you will likely look at the nouns and generate similiar tables, but when you look at the table you will likely see secondary indexes that indicate the other access patthers for the data. You will need to denormalize your data so that you can support these query patterns.

In the book the example of a hotels table with a secondary index on name was transposed into two tables a “hotels” table with a primary key of the hotel_id and a hotels_by_name table which had a primary key of the name and a clustering key of the hotel_id. The address and phone data was duplicated in the two tables.

Things like addresses can be supported via User Defined Types

After translating your “noun” tables to query centric tables you’ll need to remodel your JOIN tables. Since the JOIN table is bi-directional you’ll likely end up with tow denormalized tables. In the example in the book a RoomToAmenity table is converted into amenities_by_rooma nd rooms_by_hotel.

Paper: A Big Data Modeling Methodology for Apache Cassandra by Artem Chebotko and Andrey Kashlev

Answering your developers Questions:

Qs:

How can I make sure that I can read data immediately after it is written?
How can I avoid race conditions when inserting or updating a row?
How can I read data effciently wihtout Joins?

As:

First Use Constency Levels for strong consistency via QUORUM or LOCAL_QUORM
Second you can use batches to coordinate writes to multipe tables., but keep batch sizes small. LWTs allow you to maintain uniqueness when performing INSERTS
Finally You have to use denormalization to avoid joins.