Exadata Smart Flash Cache – a note of understanding

I thought of publishing this post to serve as a note of understanding on Smart Flash Cache in an Exadata database system. It outlines the general behavior and working mechanism of flash in co-ordination with storage intelligence. Here we go –

Exadata Smart Flash Cache
The Oracle Exadata is meant to deliver high performance through its smart features. One of its smartly carved out feature is intelligent and very smart flash cache. The flash cache is a hardware component configured in the exadata storage cell server which delivers high performance in read and write operations. There are four flash cards in each of the storage cell and each flash card contains a flash disk. In total, there are 16 flash disk contained by a single storage cell server.

In Exadata x3 version, the flash cache capacity has been increased by four times, 40 times the response speed and 100GB/sec scan rates. In X3 version, the Sun flash F40 PCIe cards total upto 22.4 TB of flash compared to 5.4TB from the X2 version. In addition, they are capable of scanning the data 1.4 times faster than X2 version. The flash belongs to eMLC category (enterprise grade multi-level cell) which employs different techniques to improve flash health, endurance and most importantly, accommodate more write cycles (20k to 30k). (Write cycle – transparent process of making room for the new incoming data by flushing out old one)

Flash Cache Hardware details
Exadata X2 version – [Sun F20 cards]

Capacity = 4(no of flash cards in a cell) * 4 (no of flash disk per card) * 24GB (single flash disk capacity) = 384GB per storage cell

Exadata X3 version – [Sun F40 cards]

Capacity = 4(no of flash cards in a cell) * 4 (no of flash disk per card) * 100GB (single flash disk capacity) = 1600GB per storage cell

The motivation and how the Flash works?
The IO operations on the disk are mechanical as a block has to travel a specified path and sit at a specific location. Being a static memory, the disk operation to place a block are significantly costlier. Flash cache is a volatile memory which enables huge reduction in IO operations by replacing them fast and rapid cache operations. In an OLTP system, the database is read write intensive and data grows at an uneven rate. The flash cache is designed to reduce the IO bottlenecks and yield atleast 20x benefits by speeding up IO operations. The cache operations are fully redundant and transparent to the end user except the statistics. The flash cache can also be worked in co ordination with IORM to control the use of flash in case of multiple databases. This feature enables the customers to reserve flash for the critical databases and ensure transparent performance benefits.

The smartness of the flash comes with the fact that it has a peculiar ability to understand the type of IO(s). Its because of this intelligence that it knows what IO(s) to cache and which one to skip. Flash cache caches IO pertaining to control files, file headers, data blocks and index blocks while it skips the IO(s) incurred in backups, mirroring, data pumps, and large table scans.

Important: Do not confuse the exadata smart flash cache with the flash cache option in Oracle database 11gr2 on solaris or linux. The “database flash cache” is an extension to SGA to expand buffer cache area on the database server. But the exadata smart flash cache resides on the storage server to cache the frequently accessed data and speed up the read (also write with X3) operations by reducing disk IO(s).

The exadata X3 series makes a considerable software level changes to the flash cache by speeding up not only read operations but write operations as well. It is capable of supporting 1 million 8k flash write per second and reading 1.5 million 8k blocks. This makes the new flash 20x more efficient than the X2 or V2 series flash. So now the cache is not only capable of keeping the hot data but in fact, all the data whatever comes in from the database. The credit goes to the new working policy adopted by the flash – known as Write Back Cache.

Make the flash persistent by partitioning flash disk into grid disk
The flash memory dumps the cachable hot data which can be lost once the power goes off (by virtue of being a volatile memory). Optionally, a portion of flashcache can be partitioned and utilized as persistent logical flash disks. Thereafter, the flash disks can be used to store (not cache) the hot data and shield it from power risks. Grid disks can be carved out of the flash based cell disks which can be further assigned to an ASM disk group. The partitioning and assigning process is very similar to the physical disks partitioning.

Though the flash based grid disks defeat the purpose of cache, but it can be used as a reserve disks in case of highly intensive write operations on the disk where the existing configuration of the disk is not sufficient. Another points which discourages partitioning the disk is that the flash storage will be trimmed to half to respect the mirroring.

Flash cache working policies – WTC and WBC
There are two working mechanisms for flash cache – write through and write back. The exadata systems before X3 release worked with write through policy. It was during the announcement of X3 exadata systems in OOW 12, it was learnt that the flash cache will also support write back caching mechanism too.

Write Through Cache (Read/Write)
The older mechanism was pretty direct and straight. The data is written directly to the disk without the intervention of flash in any ways. The acknowledgment is sent back to the database by CELLSRV via iDB. Thereafter, if the data block qualifies the caching criteria – it is written to flash as well.

While reading the data blocks, the cellsrv maintains a hash lookup to map the data blocks agains the target destination i.e. flash or disk. If flash is hit, requested data is sent to the database. If cache miss, the data is read from the disk, and once again it is validated against the caching criteria. If the block qualifies to be “hot”, the block is retained in the flash cache.

What is the caching criteria? An IO comes with two metadata values – 1) the CELL_FLASH_CACHE parameter setting defined at the segment or partition level 2) the CACHE hint associated by the database.

The metadata contains the CELL_FLASH_CACHE parameter setting for the object. Based on its value, the data block can be cached. DEFAULT means the smart flash cache has the authority to decide whether to cache it or not. KEEP means the smart flash cache must cache the data on priority. NONE means the data block caching is not required. A huge object with DEFAULT setting will not be cached. The KEEP and DEFAULT have different retention policies. Also, the upper ceiling limit for KEEP cached blocks is 80% of the total flash cache size. In addition, the unused KEEP cached blocks are flushed off from the cache if they fail the aging criteria.

The database adds another caching hint based on purpose of the IO. It can be CACHE, NOCACHE or EVICT. The first two are pretty straight and direct. The third one EVICT hints that the specific block has to be flushed out of cache.

Write Back Cache (Read/Write)
With Exadata x3 announcement, the smart flash cache adopts WBC mechanism to speed up the read as well as write operations with backward compatibility support. This implies that WBC feature can be enabled on earlier exadata systems (X2 or V2) as well by upgrading the storage cell software version and db version. The WBC feature is supported for cell storage version is onwards and db version BP 9 onwards. By default, it is disabled.

The flash cache can directly service the write operations on the database. During first time inserts, a block written to the flashcache is marked as “dirty” to signify the latest copy of the block. If the database requests to update a block which doesn’t resides in flash, it is pulled from the disk into the cache, updated and marked as “dirty”. If the database requests for a block, it is read directly from the cache, thus reducing the heavy IO operations. A block written in flash and frequently accessed can stay upto years in the flash. However, if the block is rarely accessed, only its primary copy can be retained in the flash. The rest of the data can be copied back to the disk.

Steps to enable the flash in Write Back mode –

CellCli> drop flashcache
CellCLI> alter cell shutdown services cellsrv
CellCLI> alter cell flashCacheMode = WriteBack
CellCLI> alter cell startup services cellsrv
CellCLI> create flashcache all

The Write Back Cache mode can be reverted back to the Write Through Cache mode by manually flushing all the dirty blocks back to the disk

CellCLI> alter flashcache all flush
CellCLI> drop flashcache
CellCLI> alter cell shutdown services cellsrv
CellCLI> alter cell flashCacheMode=Writethrough
CellCLI> alter cell startup services cellsrv

As I am finishing the blog post, I am realizing that I have scope of publishing one more post detailing on Write Back Cache. I shall be back with more details on Write Back Cache working and some hands on









Exadata Hybrid Columnar Compression

The basic idea behind the Exadata Hybrid Columnar Compression (hereby referred as EHCC) is to reprise the benefits of column based storage while sustaining to the fundamental row based storage principle of Oracle database. Oftentimes  the databases following column based storage claim that comparatively they needs less IO to retrieve a row than a row based storage. In a row based store, a row search requires entire table to be scanned which needs multiple IO’s. Hybrid columnar compression uses column based storage philosophy in compressing the column values while retaining the row based stores in a logical unit known as compression Unit. It helps in space storage by compression and yield performance benefits by reducing the IO’s. EHCC is best suited for the databases with less updates and low concurrency. Also it applies only to table and partition segments – not to Index and LOB segments.

How EHCC works? What is a Compression Unit?
EHCC is one of the exclusive smart features of Exadata which targets the storage savings and performance at the same time. EHCC can also be enabled on other storage systems like Pillar Axiom and ZFS storage servers. Traditionally, the rows within a block are sequentially placed in a row format next to another. The collision of unlike data type columns restricts the compression of data within a block. EHCC enables the analysis of set of rows and encapsulates them into a compression unit where the like columns are compressed. As Oracle designates a column vector to the like valued column, compression of like columns having like values ensures considerable savings in space. The column compression gives a much better compression ratio as compared to the row compression.

Don’t run into the thoughts that Exadata offers a columnar storage through EHCC. It is still a row based database storage but stressed on the word “hybrid” columnar. The rows are placed in a Compression Unit where like columns are compressed together efficiently. Kevin Clossion explain the structure of CU in one of his blog posts (http://kevinclosson.wordpress.com/2009/09/01/oracle-switches-to-columnar-store-technology-with-oracle-database-11g-release-2/ ) as “A compression unit is a collection of data blocks. Multiple rows are stored in a compression unit. Columns are stored separately within the compression unit. Likeness amongst the column values within a compression unit yields the space savings. There are still rowids (that change when a row is updated by the way) and row locks. It is a hybrid.”.

Notice that EHCC is powerful only for direct path operations i.e. Bypassing the buffer cache.

A table or partition segment on Exadata system can accommodate compression units, OLTP compressed blocks and uncompressed blocks. A CU is independent of a block or the block size but surely, it is larger than a single block as it spans across multiple blocks. The read performance is benefited from the fact that a row can be retrieved in a single IO by picking up the specific CU instead of scanning the complete table. Hence, EHCC reduces the storage space through compression and disk IO’s by a considerable factor. A compression unit cannot be further compressed.

Compression Algorithms – The three compression algorithms used by EHCC are LZO, ZLIB, and BZ2. The LZO algorithm ensures highest levels of compression while ZLIB promises a fair and logical compression. The BZ2 offers a low level of compression of data.

CU Size – On an average, a typical CU size is 32k-64k in case of warehouse compression while for archival compression, the CU size is between 32k to 256k. In a warehouse compression, around 1M of rows (16-20 rows depending on a row size) are analyzed in a single CU. In archival compression, around 3M to 10M of row data is analyzed to built up a CU.

EHCC types – EHCC works in two formats – warehouse compression and archival compression. Warehouse compression is aimed for OLTP and data warehouse applications and the compression ratio hovers between 6x to 10x. Archival compression suits the historical data which hsa less probability of updates and transactions.

EHCC DDLs – Here are few scripts to demonstrate basic compression operations on tables

–Create new tables/partitions with different compression techniques–
create table t_comp_dwh_h ( a number ) compress for query high;
create table t_comp_dwh_l ( a number ) compress for query low;
create table t_comp_arch_h ( a number ) compress for archive high;
create table t_comp_arch_l ( a number ) compress for archive low;

–Query compression type for a table–
select compression,compress_for from user_tables where table_name = ‘[table name]’;

–Enable EHCC existing tables/partitions–
alter table t_comp_dwh compress for query low;

–Enable EHCC for new tables/partitions–
alter table t_comp_dwh move compress for query low;

–Disable EHCC feature–
alter table t_comp_dwh nocompress;

–Specify multiple compression types in single table–
Create table t_comp_dwh_arch
(id number,
name varchar2(100),
yr number(4))
(PARTITION P1 VALUES LESS THAN (2001) organization hybrid columnar compress for archive high ,
PARTITION P2 VALUES LESS THAN (2002) compress for query)

Language support to CU  A CU is fully compatible with indexes (b-tree and bitmap), mviews, partitioning, and data guard. It is fully supported with DML, DDL, parallel queries, parallel DML and DDLs. Let us examine certain operations with a CU.

Select – EHCC with smart scan enables the query offloading on the exadata storage servers. All read operations are marked with direct path read i.e. bypass the buffer cache. If the database reads multiple columns of the table and does frequent transaction, the benefits of EHCC are compromised. This is how the read operation carries on –

A CU is buffered => Predicates processing => Predicate columns decompressed => Predicate evaluation => Reject CU’s if no row satisfies the predicate => For satisfying rows, the projected columns are decompressed => A small CU is created with only projected and predicate columns => Returned to the DB server.

Locking – When a row is locked in the compression unit, whole compression unit is locked until the lock is released.

Inserts – As a feature, the hybrid columnar compression works only at the load time with direct operations only. Data load technique can be any of the data warehouse load technique or a bulk load one. For conventional inserts/single row inserts, data still resides in the blocks which can be either uncompressed or OLTP compressed. New CU’s will only be created during bulk inserts or table movement to the columnar compression state.

Updates – Updating a row in the CU causes the CU to be locked and the row moves out of CU to a less compression state. This hinders the concurrency of the CU which negatively effects the compression. The effect can be observed in warehouse compression but it is certainly more in archival compression. The ROWID of the updating row changes after the transaction.

Delete – Every row in a CU has an uncompressed delete bit which is checked if a row is marked for deletion.

Compression Adviser – The DBMS_COMPRESSION package serves as the compression adviser  You can get to know about the compression paradigm of a row by using DBMS_COMPRESSION.GET_COMPRESSION_TYPE subprogram. It returns a number indicating the compression technique for the input ROWID. Possible return values are 1 (No Compression),2 (OLTP Compression),4 (EHCC – Query high),8 (EHCC – Query low),16 (EHCC – Archive high),32 (EHCC – Archive low). In addition, the GET_COMPRESSION_RATIO subprogram can be used to suggest the compression technique based on the compression ratio for a segment.

Critical look
EHCC is one of the most discussed SMART feature of Exadata database systems. It promises to provide atleast 10x storage benefits – though certain benchmarks have shown better results too. A famous citation which I see in almost other session on EHCC – a 100TB database can be compressed to 10TB thus saving 90TB of space on the storage and hence, 9 other databases of size 100TB can be placed on the same storage – thus, the IT management can be relieved of storage purchases for atleast 3-4 years assuming the data grows by a factor of two. I’ll say the claim looks pretty convincing from the marketing perspective but quite impractical on technical grounds. I would like to read it as – 1000TB of historical data can be accommodated on 100TB of storage.

A lot has been written and discussed over the topic whether Oracle is on the way to embrace the columnar storage techniques. I’ll say NO because it just looks application of the concept which looks no harm. The biggest hump for the EHCC feature is its own comfort zone i.e. database with less transactions and low concurrency. On a database which does frequent transactions and reads the data, the feature stands defeated.

References – Some of the best blog references on the topic over the web


Any conflicts/comments/observations/feedback invited on the write up.