Friday, March 6, 2015

Cassandra Compaction and Tombstone Behavior: Leveled vs. SizeTiered Compaction

Compactions in Cassandra can be contentious due to their impact on I/O load as well as increased disk space availability requirements. A primer in compaction will be provided, and the differences in Cassandra's data organization and tombstone handling between Leveled and SizeTiered compaction strategies will be discussed.

What is compaction?

Compaction is a maintenance process which re-organizes SSTables to optimize data structures on disk as well as reclaim unused space. It is helpful to understand how Cassandra handles commits to the datastore to understand why compaction is so important to Cassandra's performance and health.

When writing to Cassandra, the following steps take place:

  1. The commit is logged to disk in a commit log entry, and inserted into an in-memory table
  2. Once the memtable reaches a limit on entries, it is flushed to disk
  3. Entries from the memtable being flushed are appended to a current SSTable in the column family
  4. If compaction thresholds are reached, a compaction is run
The key takeaway is that the entry is appended to the current SSTable. Since SSTable entries are immutable, a row in an SSTable cannot be changed once written. For example, a simple schema for a column family might look like:

CREATE TABLE simple_cf (
 id int,
 text1 text,
 text2 text,
 PRIMARY KEY (id)
)

Some initial data is populated into the column family:

cqlsh:test> INSERT INTO simple_cf (id, text1, text2) VALUES (1, 'This is a test 1', NULL);
cqlsh:test> UPDATE simple_cf SET text2='This is a test 2' WHERE id=1;

The Cassandra server is flushed (nodetool flush). A (partial) update is performed after the flush:

cqlsh> UPDATE simple_cf SET text2='This is a test 3' WHERE id=1;

The Cassandra server is flushed again (nodetool flush). The SSTables representing column family simple_cf are introspected. Note how the first entry reflects both the INSERT and subsequent UPDATE in one SSTable row because they were both executed before flushing from the memtable; Cassandra will attempt to consolidate row entries in the memtable where possible before writing:

stu$ sstable2json test-simple_cf-jb-1-Data.db
[
{"key": "00000001","columns": [["","",1425595340998000], ["text1","This is a test 1",1425595340998000], ["text2","This is a test 2",1425595367745000]]}
]
stu$ sstable2json test-simple_cf-jb-2-Data.db
[
{"key": "00000001","columns": [["text2","This is a test 3",1425595457034000]]}
]

Reviewing the data in the simple_cf colulmn family inserted, above:

cqlsh:test> SELECT * FROM simple_cf WHERE id=1;
id | text1 | text2
----+------------------+------------------
1 | This is a test 1 | This is a test 3
(1 rows)

When the SELECT is issued against the record inserted, Cassandra will need to perform a read request against both SSTables in order to reconstruct the single record. During compaction, these two SSTable entries will be merged into one. Running compaction against the simple_cf column family (nodetool compact) will result in a single new SSTable replacing the original two (above):

stu$ sstable2json test-simple_cf-jb-3-Data.db
[
{"key": "00000001","columns": [["","",1425595340998000], ["text1","This is a test 1",1425595340998000], ["text2","This is a test 3",1425595457034000]]}
]

In addition to user data, SSTables can also contain various entities necessary to support Cassandra's eventual consistency paradigm. Deletions are a great example:

cqlsh:test> DELETE FROM simple_cf WHERE id=1;

After flushing to disk (nodetool flush), introspecting SSTables for simple_cf reveals:

stu$ sstable2json test-simple_cf-jb-3-Data.db
[
{"key": "00000001","columns": [["","",1425595340998000], ["text1","This is a test 1",1425595340998000], ["text2","This is a test 3",1425595457034000]]}
]
stu$ sstable2json test-simple_cf-jb-4-Data.db
[
{"key": "00000001","metadata": {"deletionInfo": {"markedForDeleteAt":1425596205184000,"localDeletionTime":1425596205}},"columns": []}
]

It is possible to see that Cassandra notes the row deletion as a new SSTable entry. Re-inserting new data, flushing, removing a column from the row, and then flushing again:

cqlsh:test> INSERT INTO simple_cf (id, text1, text2) VALUES (2, 'Testing1', 'Testing2');
cqlsh:test> SELECT * FROM simple_cf WHERE id=2;
id | text1 | text2
----+----------+----------
2 | Testing1 | Testing2
(1 rows)
-- FLUSH TO DISK (nodetool flush)
cqlsh:test> DELETE text2 FROM simple_cf WHERE id=2;
cqlsh:test> SELECT * FROM simple_cf WHERE id=2;
id | text1 | text2
----+----------+-------
2 | Testing1 | null
(1 rows)
-- FLUSH TO DISK (nodetool flush)

Introspecting new SSTables after flushing reveals:

stu$ sstable2json test-simple_cf-jb-6-Data.db
[
{"key": "00000002","columns": [["","",1425596539939000], ["text1","Testing1",1425596539939000], ["text2","Testing2",1425596539939000]]}
]
stu$ sstable2json test-simple_cf-jb-7-Data.db
[
{"key": "00000002","columns": [["text2","54f8e0be",1425596606981000,"d"]]}
]

Cassandra noted the column deletion using a "tombstone" record, appended to the current SSTable. The compaction process will eventually merge all SSTable updates and recover space used by deleted data (and the associated tombstones for deleted data).

Triggering a manual compaction and introspecting the resulting SSTable reveals:

stu$ sstable2json test-simple_cf-jb-8-Data.db
[
{"key": "00000001","metadata": {"deletionInfo": {"markedForDeleteAt":1425596205184000,"localDeletionTime":1425596205}},"columns": []},
{"key": "00000002","columns": [["","",1425596539939000], ["text1","Testing1",1425596539939000], ["text2","54f8e0be",1425596606981000,"d"]]}
]

The data from multiple SSTable entries have been merged, but the space has not been recovered. Due to Cassandra's distributed nature, notations on deleted data are retained for no less than "gc_grace_seconds" to prevent data from re-appearing on other nodes due to a split brain event in the cluster. Once gc_grace_seconds expires, the data will be removed during compaction:

cqlsh:test> DESCRIBE TABLE simple_cf;
CREATE TABLE simple_cf (
 id int,
 text1 text,
 text2 text,
 PRIMARY KEY ((id))
) WITH
...
 gc_grace_seconds=864000;
...
cqlsh:test> ALTER TABLE simple_cf WITH gc_grace_seconds=10;

Running a second compaction (nodetool compact) after setting gc_grace_seconds to an artificially low value results in the following SSTable content:

stu$ sstable2json test-simple_cf-jb-9-Data.db
[
{"key": "00000002","columns": [["","",1425596539939000], ["text1","Testing1",1425596539939000]]}
]

Because gc_grace_seconds has expired, the tombstones are removed.

Why do we care about compactions and compaction strategies?

Compaction is a fundamental maintenance task necessary to support Cassandra's on-disk data structure. Since all Cassandra SSTable entries are immutable, any updates or deletes to a row requires a new SSTable entry to be recorded. Values which are deleted are replaced by a tombstone record for the length of time configured in gc_grace_seconds. Cassandra reconciles updates by determining the most recent data, and then only returns the valid data to the client.

This means that:

  • Deletions and storage space recovery are lazy
  • A row can fragment across more than one SSTable
  • A tombstone is a fragment of a row
  • Reading a row requires reconciling all SSTable entries, and determining the most recent set of data before returning data to the caller.
  • Increased row fragments spread across multiple SSTables increases response latency.

At Polyvore, we use Cassandra extensively in production and as our data grows, so does our understanding of how Cassandra organizes its data on disk. When transitioning our user feed processing from MySQL to Cassandra, we wanted to explore how different compaction strategies affected tombstone handling (in our workload, there are many reads from the user feed as well as many updates). We undertook this investigative exercise when we ran into three notable findings on one of our clusters:

  1. The size of the tombstones in our user feed processing table approached the size of our live dataset, and
  2. We approached a level of disk utilization which would prevent further size tiered compactions from taking place.
  3. Fragmentation was noted in the SSTable - especially on wide rows.

The Test:

To test our theories, we created the following schema on a single node test cluster:

-- Table with current schema; covers test case SizeTiered
-- Please note that the sstable_size_in_mb option was set intentionally low
-- to force many compactions. Cassandra recommends this value be set
higher in production use.
CREATE KEYSPACE test WITH REPLICATION={'class':'NetworkTopologyStrategy', 'datacenter1':1};
USE test;
CREATE TABLE unprocessed_size (
  user_id int,
  reason_type text,
  insert_tuid timeuuid,
  object_class text,
  object_id int,
  reason_data text,
  PRIMARY KEY ((user_id, reason_type), insert_tuid)
) WITH CLUSTERING ORDER BY (insert_tuid DESC) AND
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=900 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy', 'min_threshold' : 2, 'min_sstable_size': 1048576} AND
  compression={'sstable_compression': 'LZ4Compressor'};
-- Table with current schema; covers test case Leveled
CREATE TABLE unprocessed_leveled (
  user_id int,
  reason_type text,
  insert_tuid timeuuid,
  object_class text,
  object_id int,
  reason_data text,
  PRIMARY KEY ((user_id, reason_type), insert_tuid)
) WITH CLUSTERING ORDER BY (insert_tuid DESC) AND
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=900 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 1} AND
  compression={'sstable_compression': 'LZ4Compressor'};

The two column families share the same structure, the only difference being the compaction strategy configured. 500,000 rows of random data were inserted into these tables with varying TTLs. Then:

  1. The TTLs were allowed to expire
  2. The gc_grace period was allowed to expire
  3. SSTables were dumped to JSON and reviewed
  4. Repeated 3 times:
    1. Major compactions forced against the datasets
    2. SSTables were dumped to JSON and reviewed

SizeTiered Compaction:

When Cassandra was first released, the only compaction strategy available was SizeTiered. Size tiered compaction allows the operator to defines the maximum number of SSTables which should make up the backend storage of a given column family. When the maximum number of SSTables is reached, Cassandra will merge the existing SSTables into a single new table.

Tombstones whose gc_grace period has been met will be removed.

SizeTiered compaction is a general purpose algorithm which is suitable for a wide range of schema designs and data access patterns. The downside to SizeTiered compaction is that over time SSTables may suffer from fragmentation (where a row's current data is stored across more than one SSTable on disk), and that the compaction process itself is I/O and storage intensive. In fact, the worst case scenario for a SizeTiered compaction is twice the space of the sum of all SSTables involved in compaction. This means that it's possible for a 100GB column family to require an additional 100GB of disk space to successfully compact. In that scenario assuming a maximum of four SSTables, the next full compaction would require 400GB of available space (800GB total needed during compation). Needless to say, requiring storage to be capped at 50% of capacity is non-ideal for some workloads.

The second issue with SizeTiered compaction is that of fragmented rows. A fragmented row is a row which is stored across multiple backend SSTables. This can occur very easily when an existing row is updated; since SSTables are immutable, an update to an existing row which is appended to one of the in-use SSTables. If this update occurs on a different SSTable from where the record was originally stored, Cassandra must reconcile all of the rows in order to determine which one is the most recent. This results in additional non-sequential I/O and operations for any read against the fragmented row.

Once the space requirement for compaction exceeds available space on the host, compaction is no longer possible without intervention. It's easy to see how an inability to compact combined with SizeTieredCompaction could result in serious read latency increases over the long term. Additionally, if compactions cannot be performed tombstones will never be reaped.

Leveled Compaction:

Given an understanding of how Cassandra stores data, is it possible to reduce storage space contention as well as reduce fragmentation? Depending on workload and schema design, possibly. Leveled compaction requires the operator to define the maximum size for each SSTable. SSTables will be grouped into levels, with each subsequent level equal to 10 times the size of the prior. So, assuming a maximum table size of 10MB:

When a record is inserted, it's immediately placed into a L0 (Level 0) SSTable. The L0 SSTables are immediately compacted into Level 1 SSTables; duplicate/updated rows are merged. When 10 L1 tables are in place, they are compacted and merged into a set of L2 tables. This process continues indefinitely.

By merging SSTables during leveled compaction there is a high likelihood that fragmented rows will be combined during each leveled compaction, subsequently resulting in better guarantees of a single row being localized to one SSTable. For read-heavy workloads, this can reduce latency significantly as substantially fewer reads across SSTables are required. Even with many updates - and even with partial row updates - leveled compaction helps reduce fragmentation by frequently compacting leveled tables as the data size grows.

Of course, there are tradeoffs... the leveled compaction process itself is potentially more resource intensive than a SizeTiered compaction strategy as all 10 SSTables in any given level must be read during compaction. The key takeaway, however, is that Tombstones will only be reaped when the impacted SSTable "moves"/is compacted to the next level and gc_grace has expired. Triggering a manual compaction is not sufficient to remove the tombstones.

When to consider SizeTiered Compaction:

Consider SizeTiered Compaction when:

  • There is uncertainty of which compaction strategy is best; SizeTiered Compaction provides a good set of tradeoffs for acceptable performance under most workloads.
  • Mixed workload of reads, updates, and deletes; avoiding updates to existing rows helps avoid fragmentation.
  • Read latency is not important; the upper bound to fragmentation is the number of SSTables representing the column family.
  • Tombstones make up a substantial portion of at rest data, and are reaped via manual compaction.
  • There is sufficient free space to store 100% growth of the largest column family.

When to consider Leveled Compaction:

Consider Leveled Compaction when:

  • Read-heavy workload; many reads, some updates
  • Workloads which frequently update existing rows
  • Read latency is important; the upper bound to fragmentation is reduced as the number of SSTables in the column family grows (due to the leveled compaction process)
  • Tombstone space recovery is not important; tombstones will not be reaped until an SSTable moves to the next compaction tier (i.e. moves from L1 to L2).
  • There is not sufficient free space to store 100% growth of the largest column family. Leveled compaction requires 10 times available storage space of the largest SSTable in a column family.

Other findings:

  • Very wide rows are bad for compaction performance; wide rows fragmented across multiple SSTables (especially with partial updates) seem to result in lazy compaction.

Final Thoughts:

Cassandra brings robust, distributed data storage to any application able to utilize its API. It also offers CQL, a language strikingly similar to SQL's DML and DDL which offers less of a learning curve than some competing solutions. Ease of use and being a distributed key/value store, however, does not necessarily translate to success: the DBA or developer must understand how Cassandra organizes and maintains its underlying data to effectively design a schema which will scale under the load of millions of users.

What's Next:

  • Investigation of Date Tiered Compaction behavior