The basic idea behind the Exadata Hybrid Columnar Compression (hereby referred as EHCC) is to reprise the benefits of column based storage while sustaining to the fundamental row based storage principle of Oracle database. Oftentimes the databases following column based storage claim that comparatively they needs less IO to retrieve a row than a row based storage. In a row based store, a row search requires entire table to be scanned which needs multiple IO’s. Hybrid columnar compression uses column based storage philosophy in compressing the column values while retaining the row based stores in a logical unit known as compression Unit. It helps in space storage by compression and yield performance benefits by reducing the IO’s. EHCC is best suited for the databases with less updates and low concurrency. Also it applies only to table and partition segments – not to Index and LOB segments.
How EHCC works? What is a Compression Unit?
EHCC is one of the exclusive smart features of Exadata which targets the storage savings and performance at the same time. EHCC can also be enabled on other storage systems like Pillar Axiom and ZFS storage servers. Traditionally, the rows within a block are sequentially placed in a row format next to another. The collision of unlike data type columns restricts the compression of data within a block. EHCC enables the analysis of set of rows and encapsulates them into a compression unit where the like columns are compressed. As Oracle designates a column vector to the like valued column, compression of like columns having like values ensures considerable savings in space. The column compression gives a much better compression ratio as compared to the row compression.
Don’t run into the thoughts that Exadata offers a columnar storage through EHCC. It is still a row based database storage but stressed on the word “hybrid” columnar. The rows are placed in a Compression Unit where like columns are compressed together efficiently. Kevin Clossion explain the structure of CU in one of his blog posts (http://kevinclosson.wordpress.com/2009/09/01/oracle-switches-to-columnar-store-technology-with-oracle-database-11g-release-2/ ) as “A compression unit is a collection of data blocks. Multiple rows are stored in a compression unit. Columns are stored separately within the compression unit. Likeness amongst the column values within a compression unit yields the space savings. There are still rowids (that change when a row is updated by the way) and row locks. It is a hybrid.”.
Notice that EHCC is powerful only for direct path operations i.e. Bypassing the buffer cache.
A table or partition segment on Exadata system can accommodate compression units, OLTP compressed blocks and uncompressed blocks. A CU is independent of a block or the block size but surely, it is larger than a single block as it spans across multiple blocks. The read performance is benefited from the fact that a row can be retrieved in a single IO by picking up the specific CU instead of scanning the complete table. Hence, EHCC reduces the storage space through compression and disk IO’s by a considerable factor. A compression unit cannot be further compressed.
Compression Algorithms – The three compression algorithms used by EHCC are LZO, ZLIB, and BZ2. The LZO algorithm ensures highest levels of compression while ZLIB promises a fair and logical compression. The BZ2 offers a low level of compression of data.
CU Size – On an average, a typical CU size is 32k-64k in case of warehouse compression while for archival compression, the CU size is between 32k to 256k. In a warehouse compression, around 1M of rows (16-20 rows depending on a row size) are analyzed in a single CU. In archival compression, around 3M to 10M of row data is analyzed to built up a CU.
EHCC types – EHCC works in two formats – warehouse compression and archival compression. Warehouse compression is aimed for OLTP and data warehouse applications and the compression ratio hovers between 6x to 10x. Archival compression suits the historical data which hsa less probability of updates and transactions.
EHCC DDLs – Here are few scripts to demonstrate basic compression operations on tables
–Create new tables/partitions with different compression techniques–
create table t_comp_dwh_h ( a number ) compress for query high;
create table t_comp_dwh_l ( a number ) compress for query low;
create table t_comp_arch_h ( a number ) compress for archive high;
create table t_comp_arch_l ( a number ) compress for archive low;
–Query compression type for a table–
select compression,compress_for from user_tables where table_name = ‘[table name]';
–Enable EHCC existing tables/partitions–
alter table t_comp_dwh compress for query low;
–Enable EHCC for new tables/partitions–
alter table t_comp_dwh move compress for query low;
–Disable EHCC feature–
alter table t_comp_dwh nocompress;
–Specify multiple compression types in single table–
Create table t_comp_dwh_arch
PARTITION BY RANGE (yr)
(PARTITION P1 VALUES LESS THAN (2001) organization hybrid columnar compress for archive high ,
PARTITION P2 VALUES LESS THAN (2002) compress for query)
Language support to CU – A CU is fully compatible with indexes (b-tree and bitmap), mviews, partitioning, and data guard. It is fully supported with DML, DDL, parallel queries, parallel DML and DDLs. Let us examine certain operations with a CU.
Select – EHCC with smart scan enables the query offloading on the exadata storage servers. All read operations are marked with direct path read i.e. bypass the buffer cache. If the database reads multiple columns of the table and does frequent transaction, the benefits of EHCC are compromised. This is how the read operation carries on -
A CU is buffered => Predicates processing => Predicate columns decompressed => Predicate evaluation => Reject CU’s if no row satisfies the predicate => For satisfying rows, the projected columns are decompressed => A small CU is created with only projected and predicate columns => Returned to the DB server.
Locking – When a row is locked in the compression unit, whole compression unit is locked until the lock is released.
Inserts – As a feature, the hybrid columnar compression works only at the load time with direct operations only. Data load technique can be any of the data warehouse load technique or a bulk load one. For conventional inserts/single row inserts, data still resides in the blocks which can be either uncompressed or OLTP compressed. New CU’s will only be created during bulk inserts or table movement to the columnar compression state.
Updates – Updating a row in the CU causes the CU to be locked and the row moves out of CU to a less compression state. This hinders the concurrency of the CU which negatively effects the compression. The effect can be observed in warehouse compression but it is certainly more in archival compression. The ROWID of the updating row changes after the transaction.
Delete – Every row in a CU has an uncompressed delete bit which is checked if a row is marked for deletion.
Compression Adviser – The DBMS_COMPRESSION package serves as the compression adviser You can get to know about the compression paradigm of a row by using DBMS_COMPRESSION.GET_COMPRESSION_TYPE subprogram. It returns a number indicating the compression technique for the input ROWID. Possible return values are 1 (No Compression),2 (OLTP Compression),4 (EHCC – Query high),8 (EHCC – Query low),16 (EHCC – Archive high),32 (EHCC – Archive low). In addition, the GET_COMPRESSION_RATIO subprogram can be used to suggest the compression technique based on the compression ratio for a segment.
EHCC is one of the most discussed SMART feature of Exadata database systems. It promises to provide atleast 10x storage benefits – though certain benchmarks have shown better results too. A famous citation which I see in almost other session on EHCC – a 100TB database can be compressed to 10TB thus saving 90TB of space on the storage and hence, 9 other databases of size 100TB can be placed on the same storage – thus, the IT management can be relieved of storage purchases for atleast 3-4 years assuming the data grows by a factor of two. I’ll say the claim looks pretty convincing from the marketing perspective but quite impractical on technical grounds. I would like to read it as – 1000TB of historical data can be accommodated on 100TB of storage.
A lot has been written and discussed over the topic whether Oracle is on the way to embrace the columnar storage techniques. I’ll say NO because it just looks application of the concept which looks no harm. The biggest hump for the EHCC feature is its own comfort zone i.e. database with less transactions and low concurrency. On a database which does frequent transactions and reads the data, the feature stands defeated.
References – Some of the best blog references on the topic over the web
Any conflicts/comments/observations/feedback invited on the write up.