Cassandra The Definitive Guide
February 6, 2021
Chapter 1 - Beyond Relational Databases
Details whats difficult about scaling relational databases
Relational databases are reviewed.
ACID
- Atomic: writes are all or nothing
- Consistent: The database moves from one correct state to another.
- Isolated: Transactions executing concurrently execute in their own “space”. Transactions cannot modify the same data
- Durable: Once data is written it cannot be lost.
2 Phase Commit
- A transaction is done in a prepare and then commit phase to make it atomic
- In a ditributed system this complex as the coordinator node must lock the resources on all hosts which have the data.
- 2 Phase Commit is also blocking
Recommended reveiw: Starbucks Does Not Use Two-Phase Commit" + Desiging Data Intensive Appplcations.
NoSQL stores succeed by relaxing ACID and working around not having two phase commit
Chapter 2 - Introducing Cassandra
Operational Costs of Master/Replica Page 19
By avoiding master/replica we get a simplier architecture how much is Shopify needing to paper over this situation with extra engineering cost for MySQL and Redis?
Tunable consistency Page 20-21
Think of Cassandra as tunably consistent rather then eventually consistent. You get to choose your consistency/availablity tradeoffs.
Strict Consistency: also known as sequential consistency. Every read will return the most recently written value. Requires a globally syncronized clock to determine the order of the writes.
Causal Consistency: Removes the global clock and attempts to use a more semantic approach. “Attempting to determine the cause of events to create some consistency in their order”
Weak (eventual) consistency: Writes will eventually propagate through the system.
Page 22: Cassandra chooses alway to be a writable opting to defer the complexity of reconciliation to read operations.
Replication factor determines how many nodes receive the write. The consistency level is how many nodes must ack the read/write.
Page 23: Cap Therom Review
Page 26: “Cap twelve years later” and that the pick two is somewhat missleading as systems such as cassandra have had advances in partition recovery such as hinted handoff and read repair.
Page 27: Cassandra is a “Partitioned Row Store”. Data is stored in sparse multidimensional hash tables. “Sparse” means that for any given row you can have one or more columns. Partioned means that each row has a unique partition key used to distribute the rows accross mutliple datastores. This is sometimes called a “wide column store”
Page 28: Originally Cassandra was schema free, but later CQL was adopted and the underlaying storage changed to to closely align with it. You can change schema online but its different the the earlier schema less thrift interface.
Page 30 - 31: When to use Cassandra
- When you have a problem that will scale beyond a single instance DB
- When you require a lot of writes
- When you want Geolocation
Chapter 3 - Installing Cassandra
Untar the zip, have Java 8 or 11 either OpenJDK or OracleJDK
$CASSANDRA_HOME and $JAVA_HOME are needed.
-f will run cassandra in foreground
cqlsh can be told what to connect to via its args: cqlsh host port it will also use
CQLSH_HOST and CQLSH_PORT environment variables.
cqlsh is case insensitive
Key spaces should be named with snake_case rather then camel
Text and varchar types are synonymous
Page 50: A table is created with a first_name, last_name primary key. Then later a select what a where = last_name is done. This fails with and error because it may have unpredictable performance… Very interesting.
Page 52
select count(*) from user also returns a warning this because this operation will be very
expensive in cassandra.
Page 53: There are offical docker images: see “cassandra”
Chapter 4 - The Cassandra Query Language
Page 57: A cassandra row is a nested dictionary with the primary key providing a pointer to the row.
{ 1: { name: alex, age: 39, weight: 180}, 2: { name: sarah, age: 39 }, 3: { name: maddie, age: 16, weight: 125, height: 5.5} }
1,2,3 are primary keys pointing to each row which can any number of columns
Cassandra lacks NULL, you just don’t store that data.
“composite key” (or compound key) to represent groups of related rows, also called “partitions”
The composite key consists of a “partition key” plus and optional set of clustering columns.
The parition key determines which node will store a row and itself can consist of multiple columns.
clustering columnes are used for how data is sorted inside a parition. (see page 58)
a “static column” which is for storing data that is not part of the primary key but is shared by every row in the parition.
Page 59: column = a name value pair row = a contianer for columns referenced by primary key parition = group of related rows that are stored together on the same nodes table = container for rows organized by paritions keyspace = a container for tables cluster = container for keyspaces that spans one or more nodes
Page 61 - Data access requires a primary key
- SELECT, INSERT, UPDATE and DELETE all operate in terms of rows.
- INSERT and UPDATE all primary key columns must be specified using the WHERE clause so it can update on a specific row.
- SELECT and DELETE can operate in terms of one or more rows within a partition, an entire partition or even multiple paritions by use the WHERE and IN clauses.
Page 62: Insert, Update and Upsert Since cassandara uses an append model there is no difference between insert and update. If you insert a row that is the same primary key as an existing row the existing row is replaced. This is called an Upsert.
Page 63: Columns - timestamps and time to live
Timestamps:
- When you write to cassandra a timestamp in microseconds is generated for each column value inserted or updated. Internally cassandra users these timestamps for resolving any conflicting changes that are made to the same value. This is called “last write wins”
- writetime(column) reports when this column was writen.
- You cannot use the writetime() function on primary key columns
- You can set timpstamps via the USING TIMESTAMP predicate
Time to Live (ttl) page 65
- ttl defaults to null (never expire)
- USING TTL will set the TTL on a column
- the ttl(column) function will report it
page 66: TTL is stored per column for nonprimary key columns
CQL Types page 66
Numeric types as expected, they map the the native java equavlinets. Including a decimal type
Textual types:
- text,varchar = the same a UTF-8 string
- ascii = should only be used for legacy data.
Time and Identity types
- timestamp - encoded as a 64bit signed int. Supports ISO 8601 dates as input
- date, time added in 2.1 to support storing just the date or time of day independant of the timestamp.
- uuid: 128 bit value uses a type 4 uuid. You can create a uuid via the CQL uuid() function
- timeuuid: a type 1 uuid which is based on the mac of the computer the system time and a squence number to prevent duplicates. This is a “conflict free” timestamp. Theres a number of convience functions for this type now(), dateOf(), unixtimestampOf(). These helper funcions mean that timeuuid is more often used the uuid
Page 69: Primary keys are forever. You cannot change them without rewriting the data into a new table.
Other types:
- boolean
- blob - no validation is done on read/write of the blob contents
- inet - ipv4 or ipv6 address
- counter - 64bit signed int which can only be incremented or decrimented. This allows for race free incrementing cross datacenters.
Collection Types (page 72)
Rather then storing multple emails or addresses as say “email1”, “email2” you can place this data in a collection type.
set
- stores a collection of elements, returns in sorted order, you can insert without reading first
- you remove from set using the “-” and null the set with the empty set.
list
- order list stored in order of insertion
- elements are posted or prepended to the list using
+ - You can address elements in the set via th index with the [1] operator
- lists tend to be expensive because updating or deleting an item requires traversing the whole list
map
- contains a collection of key value pairs which can be any time besides counter
Tuples: Page 75
a Tuple is a fixed record. In practice not often used as the type names are not stored.
example:
address tuple<text, text, text, int> for (street, city, state and zip)
User-Defined Types - page 76
Similar to tuples but you can specify the value names for each position.
example:
CREATE TYPE address (
street text,
city text,
state text,
zip_code int);
User defined types are scoped to their keyspace and you can see them via the describe keyspace command
You cannot nest user defined types into collections without freezing them which serializes them into a blob.
Chapter 5 - Data Modeling
This is the hotel example from the Cassandra docs done again
Design Differences Between RBDMS and Cassandra (Page 83)
No Joins
- You have to execute joins in the client or create a second denormalized table that represents the join
- Its perferred to create the second table and joins in the client should be the rare case. Its perfered to duplicate the data
No referential Integrity
- Lightweight transactions and batches exist but cassandra has no concept of refereiential integratiy accross tables.
- You will likely have related ids between tables but ideas like casscading deletes and foriegn keys don’t exist
Denormalization: Cassandra will have tables with duplicated data.
Query First design: Model the queries first and let the data be organized around them. Think of the most common query paths your application will use and then create the tables that you need to support them.
Designing for optimal storage
- how the data is stored on disk matters in cassandara
- minimize the number of partiions that must be searched in order to satisfy a given query.
sorting is a design decision
- The sort order on queries is fixed and determined entirely by the selection for clustering columns you supply in the CREATE TABLE command.
- You can add ORDER by but its really just symbolic, you can only order in the order of the clustering columns
Page 88: Chebotko Diagrams are used to help you understand the design.
Page 90: They need to query over a range of dates so the date needed to be included in the clustering key.
Patterns and Anti-Patterns
“wide row/parition” - group multiple related rows in a parition in order to support fast access to multiple rows within the parition in a single query
“times series” - a series of measurements at a specific time intervale are stored in a wide parition where the measurement time is used as part of the partition key. This is typically used with sensor data. You could use this to store all writes to a customers bank account and then allow the application to calculate the customers balance. This rather then using update+transactions.
Its a bad idea to use the rows as queues. Don’t do it as it will pollute the LSM segments with tombstones which are terrible for performance.
Cassandra is not good at delete - keep this in mind
First we design the logical model then the physical model.
Using UUIDs as references Page 95
The authors indicate it may make sense to use UUIDs to identify the things such as hotel_id, for the purposes of simplicity they don’t and rather use a text field with a human generated hotel code in it. They do reocmmend you consider using uuid’s in the real world application to reduce coupling.
page 95 and 96 lay out the “physical” model of the hotels and reservations keyspaces.
Whats interesting is that they put these in two different keyspaces. Why? The authors note that the User Defined Types (UDTs) cannot be shared between keyspaces so you have to declare them twice.
Calculating Partition Size page 97
You should not make size a partition to be larger then 100,000 cells (max 2billion are supported)
To calculate the size of a partition:
Number of Static Columns + Number of Rows * (Number of values per row - Number of Primary Key Columns - Static Columns)
Tables can be altered so constraining the number of rows in a partition to keeping this sane
Calculating Size on Disk page 98
They changed the storage engine in 3.0 and the SSTable files take less space.
To break up large partitions add a column to the partition key. You can “bucket” the data into larger buckets for example adding yet another column with a courser partitioning eg replace a day with a month column
Chapter 6: The Cassandra Architecture
Cassandra is datacenter and rack aware.
Gossip protocol allows each node to keep track of state information about the other nodes. The Gossiper runs every second.
How Gossiper works (page 109)
- Once every second the gossiper will contact a random node.
- It will send its choosen friend a DigestSyn message
- The friend will respond with DigestAck
- The initiator after receiving the ack will respond with DigestAck2
If the endpoint does not respond it will be marked as dead (convicts) and placed in a local list.
Cassandra uses Accural failure detection.
A node is scored as to likelyhood of liveness “suspicion”. The algorithm is adaptive to problematic networks. The “Phi convict threshold” determines if the node has failed. This tunable is not linear. In general it will take Cassandra 10 seconds to determine if theres a failed node.
Snitches Page 110
Snitches are used to determine the cluster topology for read operations.
The “SimpleSnitch” is topology unaware.
Rings and Tokens
Cassandra represents the data managed by a cluster as a ring. Each node in the ring is assinged one or more ranges of data described by a token, which determines its position in the ring.
A node claims ownership of the range of values less than or equal to each token and greater then the last token of the previous node, known as a token range.
CQL provides a token() function that can be used to request the value of the tokn corresponding to a parition key.
Virtual Nodes or vnodes make it so the token range is broken up into smaller ranges and then each node is assigned multiple tokens. You can allocate more tokens to larger machines via the num_tokens config option.
Partitioners page 113
The partitioner determines how data is distributed to nodes.
The partitioner is a hash function for computing the token of a partition key.
In practice the partitioner is not often changed and cannot be changed after the cluster has been initialized.
Replication Strategies page 114
a node serves as a replica for different ranges of data. Cassandar replicas data accross nodes in a manner transparent to the user. The replication factor is the number nodes in your cluster that will receive copies of the same data.
Consistency Levels Page 115
You specificy the consistency level on each read and write query. It determines how many replicas must respond to the request for it to succeed.
To achieve strong consistency in cassandar Read replica Count + Write Replica Count > Replication Factor
Note: Replciation Factor and Consistency Level are not the same. The Replication Factor is set on the keyspace The Consistency level is set on the request.
Coordinator Node - the node which receives and services the read or write request.
Hinted Handoff page 118
When a node is offline and the coordinator node receives a write the coordinator can record a hint which is to say hold on to this write until the node has recovered. When the node is online it will send the write on.
Hinted writes do not count in for Consistency Level.
Hinted writes cannot be used for reads.
If too many hinted writes build up when the failed node comes online it can be overwhelmed/flooded with write requests for various coordinators. The number of hinted writes which can be stored in the coordinator node is limited to prevent this.
Anti-Entropy & Read Repair page 119
As reads happen disaggreing replica as brought back in sync. (read repair)
Anti-entropy or manual repair is initated by an operator.. It causes cassandra to run validation compaction
The nodes trade Merkle Trees with neighboring replicas to check their table data.
In a merkle tree every parent node is a hash of all its direct child nodes.
Lightweight Transactions and Paxos page 120
A lightweight transaction creates “linearizable consistency” which means no other client can modify the same cell we modify between our read-then-write operation.
LWTs uses and extension of Paxos to do this and avoid two phase commit. A LWT will require 4 round trips between the coordinator node and its replicas meaning it is very expensive.
LWTs are limited to a single parition.
Memtables, SSTables and Commit Logs Page 122
On write, first data is written to a commit log which will be used in crash recovery. The commit log is only read during node recovery operations after a crash. This is the “durable_writes” setting on the keyspace.
After writing to the commit log the write is inserted in the memtable; Which is a memory table containing data for a specific table.
When the memtable gets large it is dumped to disk into a file called an SSTable and a new memtable is started. Multiple memtables can exist in memory for single table with many waiting to be flushed.
After the memtable finishes flushing to disk a bit is inserted into the commit log to indicate that the data for that memtable in the commit log is nolonger needed.
All writes in cassandra are append operations and sequential.
Compaction performs a merge step of a merge sort on SSTables and removes the old files on success.
Bloom Filters Page 124
A bloom filter is an algorithm to determine if an element is a member of a set. It can provide false positives but not false negatives. On positive the value is searched in the SSTables.
This is used to reduce the disk access on key loops.
Caches page 125
Key Cache - Faster read access into SSTables on disk, stored in JVM heap
row cache - caches fequently accessed rows, stored off heap in the JVM
chunk cache - added in 3.6 to store uncompressed chunks of SSTables to speed up acces (off jvm heap)
counter cache - improves counter type performance
Row Caching is disabled by default.
Compaction page 125
Compaction occurs to merge SSTables, durring compaction the data in the SSTables are merged, columns are combined obsolet values are discarded and a new index is created.
An SSTable consits of three files the Data, Index and Filter.
On compaction a new SSTable is created from the merged tables.
Compaction improves performance by reducing the need to search old SSTables for values.
The compaction stragey used is set per table, examples are:
- Size Tiered (STCS) the default best for write centric tables
- Leveled (LCS) best for read tables
- Time Window (TWCS) best for times series data
major compaction or full compaction is a feature in cassandra for consolidating multiple SSTables into a single table. The feature still exists but is not typically useful in newer releases of cassandra.
Deletion and Tombstones page 127
Data in cassandra is not deleted (the SSTables are immuable) but rather a tombstone is place in the memtable to indicate that the data is deleted. This is akin to a soft delete in a RDBMs
During compaction tombstones are removed once they pass the age of gc_grace_seconds which by default is 10 days.
If a failed node is done greater the gc_grace_seconds it should not be added back into the cluster as it will contain deleted data that will never be deleted.
System Keyspaces page 133
Cassandra stores metadata about itself in system keyspaces. these can be seen via describe tables.
Chapter 7 - Designing Applications with Cassandra
In this chapter the authors lay out a microservices based design for the hotels applcation from chapter 5. The authors recommend you read “Building Microservices by Sam Newman and Domain Driven Design by Eric Evans.
They focus on three ideas first
- Encapsulation - mircoservices should not use the database as an integration point
- Autonomy - Each microservice should be independantly deployable - each should have their own datastore
- Scalablity - Different services shold be able to be scaled independantly.
You identify “Bounded Contexts” which are groups related enties. In the case of the Hotels application it is divided into two bounded contexts “Hotel Domain” and “Reservation Domain”
In each context you break the Domains into services which will each own specific groups of tables.
With cassandra a natural approach is to assign denormalized tables representing the same basic data type to the same service.
Pay attention to coupling and cohesion:
- Tables owned by a service should have a high degree of relatedness (cohesion)
- There should be a low degree of coupling between contexts
Representing database models in CQL
Key-Value models
The key is the parition key the remaining data can be stored in a value column as a text or blob type. Keep values at <5MB.
Document Models
Option One Flexible Schema: store nonprimary key columns in a blob. This is not a good solution.
Option Two use CQL Support for reading and writing data in JSON vis INSERT JSON, and SELECT JSON -> This requires that all the referenced attributes are defined in the table schema.
Graph Models
Some Datastax thing is recommended…
As you application evolves you will end up with more denormalzed views of the data.
These may get expensive to maintain and you may want to look into alternatives.
Secondary Indexes page 144
You can place a secondary index on a column to avoid using ALLOW FILTERING and reading all rows across all nodes.
Secondary Index Pitfalls page 147
Secondary indexes are slower and more expensive the normal queries, do not use Secondary Indexes with
- Columns with High Cardinality
- Columns with very low cardinality - fat rows/partitions
- Columns that are frequently updated or deleted - “generate errors if the amount of deleted data builds up quicker then compaction can handle it
It is perfered to use denormalization or materialized views to Secondary Indexes
SASI: A new secondary index implementation page 147
- Cassandra 3.4 added and experimental SI implementation called “SSTable Attached Secondary Index”
- SASI indexes are calculated and stored as part of each SSTable file.
- (the original SI impelemenation in cassandra stores SI’s in “hiddnen” tables.)
- SASI’s support inequality and like based searchses
Materialized Views (page 148)
Materialized Views are the idea of having cassandra maintain the denormalized views of your data rather then doing it in your application.
They are crated with the “CREATE MATERIALZED VIEW” Command and they use a SELECT FROM WHERE to build the view.
There are a number of unfixed bugs with Materialzed views such as:
- Limitations on the selection of primary key columns and filters
- Aggreegates in materialzied views are not supported
- Materialzied Views were a big part of the 3.0 release and drive a significant design change such as the new storage engine. Most of the major fixes in the 3.X release are around MV’s GLHF
NOTE As of 2018 TLP recommended disabling materalized views in production!!!
They have implemented the reservation service in Java you can read it at https://github.com/jeffreycarpenter/reservation-service
When you create microserices it is advised to use one keyspace per microservice
How do you maintain data integrity between microservices? (page 154)
Patterns
- Add an orchestration service between the two data services. The orchestration service owns no datastore and interacts with the CRUD services to complete a bussiness process. The example in the book is of the booking service orchestrating between the reservation and inventory service.
- Add a message queue such as kafka to create a stream of change events which will be consumed async by the downstream services. The inventory service might choose to subscribe to events related to reservations to make corresponding changes in invenotry. This is called “choreography”.
Both these patterns are distributed systems themselves and give tradeoffs in consistency to work around this you can try:
- using a distributed transaction framework like Scalar DB
- Using a analytics tool like Apache Spark to check & fix data as a background task
- Use a CDC outbox pattern
The “KillrVideo” application is provided as an example. See github.com/killrvideo
Chapter 8 Application Development with Drivers
Cassandra has a history of drivers from its old thrift protocol. Most of these are now retired.
Datastax used to provide propreitry drivers but these were merged into the opensource drivers in 2019.
CqlSessions are expensive as they maintains TCP connections to multiple nodes. Use a single CSQLSession through your application and reusue it rather the setup/teardown. You should use a single session for keyspace with applications that use multiple keypspaces.
PreparedStatemtns are more efficent then SimpleStatments and more Secure
The driver is configured with a specific LoadBalancingPolicy to keep from sending all traffic to a specific coordinator node. The policy has these options
- Round-Robin
- Token Awareness - when using perpared statement the policy uses the token value of the parition key to select an optinamal node for the query (TokenAwarePolicy)
- DataCenter Awareness The local DC must be configured in the driver then the driver will prefer nodes in this DC>
Theres settings for query retry (RetryPolicy and RetryDecision)
The driver supports Speculative Execution which can send the query to two nodes and see if one responds first.
By default the driver creates a single connection per node.
Node Discovery Page 178
The first node the driver connections to maintains a “control connection” which is used to maintain information about the state and topology of the cluster. This connection is used to discover the other nodes in the cluster.
Schema Access Page 179
The schema itself is eventually consistent and different nodes can see different schemas at different times.
Chapter 9 - Writing and Reading Data
Writes are sent with a consitsency level.
- ANY - Allows for even a hint to count as a successful write
- LOCAL_ONE - the node accepting the write must be in the local DC
- LOCAL_QUORUM - majority of replicas in the local DC have recieved the write
- EACH_QUORUM - QUORUM of nodes in each DC
The Write Path (page 187)
Coordinator sends simultaneous write requests to all local replicas.
If the cluster spans multiple DCs the local coordinator node selects a remote coordinator in each of the other DCs to forward the write to the replicas in that datacenter.
Each of the remote replicas acks the write directly to the original coordinator.
Once the coordinator recieves acks from the desired number of replicas it acks the write to the client.
Once a replica receives a write
- The data is journalled to the commit log
- The data is written to a memtable
- The row cache is invalidated (if used)
- If the memtable or commitlog is “filled” by the write a flush is scheduled to run
- The write is then acked to the client
- if needed the node executes the flush. The contents of the memtables are written to SStables on disk and the commit log is cleared.
- Compaction tasks are scheduled and a compaction is done if needed.
This is the simplest write path which does not inolve the special counter type or materialized views.
Commit Logs Page 189
- found in data/commitlog
- named CommitLog-.log where version is an integer for the commit log format.. Eg for 4.0 its 7
SSTable files
- found in data/data
- Each SSTable files are put in a subdirectory here with a UUID
- The filenames are ---.db
- version: two charcaters for the version of the SSTable format. 4.0 is “na”
- generation - an index number that is incremented every time a new SSTable is created for a table.
- implementation - a reference to the impelmentation interface in use. “big” is bigtable format
The SSTable consists of multiple file components
- Data.db files that store the actual data these are the only thing which are backed up.
- CompressionInfo.db - compression metadata on Data.db
- Digest.crc32
- Filter.db - The bloom filter for the SSTable
- Index.db - row and column offsets within the corresponding *-Data.db file for reading the file
- Summary.db - A sampel index for faster reads
- Statistics.db - Store stats on the sstable
- TOC.txt lists the file components for this SSTable
Lightweight Transactions Page 191
Common LWT Semantics
- INSERT … IF NOT EXISTS, this bocks upsert to ensure a value is unique.
- UPDATE IF used to make sure that a row has the expected value that cannot change before a write occurs. This is frequently used to manage inventory counts.
LWT can be used on schema creation. This is useful to script multiple schema updates.
LWTs select a serial consistency level
- SERIAL default, a quorum of nodes must respond
- LOCAL_SERIAL, a quorum o fnodes in the local dc must respond.
LWT are limited to a single partition.
Batches Page 194
Allows you to group multiple modifications into a single statement.
- Only allows modification statments.
- Can be Logged or Unlogged, logged have more safeguards
- Batches ARE NOT TRANSACTIONS, but you can include LWTs in a batch. The LWTs in a batch must apply to the same partition
- Counter modifications require a special batch known as a counter batch
Used for:
- making multiple updates to a single partition
- keeping tables in sync, for example making modifications to denormalized tables that store the same data for different access patterns
NOTE: Batches are not for bulk loading!!! They are actually slow
Logged batches are atomic (as such they are sometimes called atomic batches)
Batches are limited by a size parameter in KB in cassandara.yaml (50K is default)
Reading - Page 197
When reading higher consistency levels give you greater confidence you are reading the most recent data.
When multipe read values are sent to the coordinator and they have different timestamps the news value wins and is sent to the client. The out of date replica is updated in an operation called a read repair.
Read Consistency Levels
- ONE, TWO, THREE - Return the record held by the first node(s) that respond to the query. The record is checked agiainst the same record on the other replicas. If any are out of date a read repair is preformed
- LOCAL_ONE - similiar to one with the requirement tha thte reading node is in the local DC
- QUORUM - Query all nodes, once a majority or replicas respond return to the client the value with the most recent timestamp. The if needed read repair any remaining replicas
- LOCAL_QUORUM - similar to quorum where the responding nodes are in the local DC
- EACH_QUORUM - QUROM across two DCs
- ALL - Like it sounds, fail the read if any don’t respond.
The Read Path Page 199
The coordinator sends a read request to the fastest replica (perhaps itself). The fastest replica is determined by the snitch. A digest request is sent to the two other replicas.
A digest request will return a digest of the data rather then the actual data.
The digest on the data received from the replica is calculated and comparied vs the digest recieved from the other replicas. If the digests check out the data will be returned to the client. If the digests differ a read repair must be done.
The replica read order is
- Row Cache
- Key Cache
- memtable
- SSTables
There is only a single memtable for a given table, but there may be MANY SSTables for a single table.
Optimizations for searching SSTables:
- A bloom fliter is used to determine if the requested partiion does not exist in a given SSTable. this fast fails on searching some of the tables.
- Next the Key cache is checked to is if the offset in the file is cached. The key cache is a map structure in keys are a combination of the SSTable file describe and parition key. The values are offset locations into SStable files. This helps reduce seeks within the SSTable files.
- Finally if no offset is found in the key cache a two-level index store on disk is used. The first level is the partition summary which is used to obtain an offset for searching for the parition key within the second level index aka “parition index” The parition index is where the offset into the SSTable for the partition key is stored.
The SSTable are read at the desired offset. Once data is obtained from all SStables cassandra merges the SSTable data by selecting the values with the latest timestamp for each requested column.
The result is cached in the row cache.
Read Repair page 201
On read repair the coordinator makes a full read request to all replicas.
The coordinator merges the data by selecting a value from each requested column.
- Latest timestamp wins
- Lexicographical Lottery if the timestamps are the same and the values differ
Read repair can be done AFTER returning data to the client on low consistency levels.
Transient Replicas page 202
You can add “transient replicas” to the cluster to catch writes when some replicas are offline. Writes are then sent back to the offline replicas when they recover. This is an experimental feature in the 4.0 release.
Transient replicas break: Read Repair, Batches, LWTs, counters, Secondary Indexes and Materlized views
Paging page 208
Paginating Results is supported in the newer versions using the driver. The limit clause cannot be used for pagination, it just returns a limited number of records
Deleting Page 208
When data is deleted a Tombstone is placed in the SSTable file indicate it has been deleted. This is to allow nodes that are down when data is deleted to receive learn that data has been deleted.
After the default of 10 days the data will be dropped from the SStables during compaction.
The more tombstones you have the longer it takes cassandra to read data.
Techniques to minimize delete impact:
- Avoid writing NULL into your tables as this is interpreted as a delete.
- Delete data at the larges granulatiry you can. Ideally an entire parition at once!
- Excercise care when updating, avoid replacing the entire contents of a list, set or map. Instead update only the elements you need to modify
- If you set TTL cassandra will expire data automatically
- On time series data use the TimeWindowCompaction
Chapter 10 - Configuring and Deploying Cassandra
(skimmed)
Cassandra Cluster Manager (ccm) is a set of python scripts for running a cassandra cluster on a single machine.
A cassandra cluster has a name and that name is writen into the datafiles. If you start up the cluster with a different name it will fail to read the datafiles and shutdown.
Seed Nodes
A seed node is used as a contact point for other nodes so Cassandra can learn the topology of the cluster. That is what hosts have what ranges. Seed nodes are used for “bootstraping”
You should have at least two seed nodes per DC
Stiches Communicate the Cluster Rack/DC Layout
- Simple - a flat cluster
- PropertyFile - just hardcode the nodes into a file and copy it to all nodes
- Gossip - Put the rack and dc value for each node in a file and the gossiper passes it around.
- RackInffering - looks at IPs
- DynamicEndpoint
- Cloud Snitches (Ec2Snitch, Gcp…)
Chapter 11 - Monitoring
(skimmed)
JMX
MBeans
Nodetool (which uses Mbeans)
Virtual Tables
Logging
- system.log INFO messages or Greater, cluster membership changes
- debug.log - DEBUG and above. Flusing and compaction
- gc.log - JVM GC, standard JVM log
There exists a full query log you can enable in 4.0 to see all the queries.
Chapter 12 - Maintenance
Simmed/Skipped - Lots of detail on repair.
Chapter 13 - Performance Tuning
Set a well defined performance goal that contains both a throughput and latency goal like:
30,000 read operations per second from table blah at 5ms 99th percential latency.
Cassandra Stress Page 304
- cassandra-stress comes with cassandra
- write and read options that produce some fun output.
- You can define a yaml file to have cassandra-stress run on your own tables you’ve created.
tlp-stress
- “The Last Pickle” similiar to cassandra-stress but with a simplier syntax.
- Has built in workloads for command data models such as time series and key value.
- Has options for testing secondary indexes, materialzied views and LWTs and ALLOW Filtering.
NoSQLBench
- Was called DSBench
- Created by DataStax
- Extensable Tool for benchmarking complex workloads
“nodetool proxyhistograms” - helps find slow coordinator nodes “nodetool tablehistograms” - helps find large paritions
The bigest key to performance is table design!
You can configure individual nodes to trace some or all of their queires via the nodetool settraceprobablity
Chapter 14 - Security
Skimmed.
Chapter 15 - Migrating and Intergrating
Moving a RDBMS application to cassandra requires translating the data models.
You can do this indirectly by reverse engineering the application and determining its data access paths or you can do this directly by translating the databases entity relationship model.
When you directly translate the model you will likely look at the nouns and generate similiar tables, but when you look at the table you will likely see secondary indexes that indicate the other access patthers for the data. You will need to denormalize your data so that you can support these query patterns.
In the book the example of a hotels table with a secondary index on name was transposed into two tables a “hotels” table with a primary key of the hotel_id and a hotels_by_name table which had a primary key of the name and a clustering key of the hotel_id. The address and phone data was duplicated in the two tables.
Things like addresses can be supported via User Defined Types
After translating your “noun” tables to query centric tables you’ll need to remodel your JOIN tables. Since the JOIN table is bi-directional you’ll likely end up with tow denormalized tables. In the example in the book a RoomToAmenity table is converted into amenities_by_rooma nd rooms_by_hotel.
Paper: A Big Data Modeling Methodology for Apache Cassandra by Artem Chebotko and Andrey Kashlev
Answering your developers Questions:
Qs:
- How can I make sure that I can read data immediately after it is written?
- How can I avoid race conditions when inserting or updating a row?
- How can I read data effciently wihtout Joins?
As:
- First Use Constency Levels for strong consistency via QUORUM or LOCAL_QUORM
- Second you can use batches to coordinate writes to multipe tables., but keep batch sizes small. LWTs allow you to maintain uniqueness when performing INSERTS
- Finally You have to use denormalization to avoid joins.