Numbers don’t lie. They empower you to be smart and stay decisive. Recently, I stumbled upon the bookmetrix portal that publishes by-chapter metrics of a book. Metrics are straight enough to discover book’s most loved chapters. Although the numbers were not skewed by high margin, I realized the fact that “Data Lake Ingestion Strategies” has been a “talk” among the readers. With that being said, I have decided to make the full chapter available for free for download.
You can download the free chapter here – Data Lake Ingestion strategies. Here is a quick synopsis –
What is data ingestion?
Data ingestion framework captures data from multiple data sources and ingests into big data lake. The framework securely connects to different sources, captures the changes, and replicates them in the data lake. Data ingestion framework keeps the data lake consistent with the data changes at the source systems; thus, making it a single station of enterprise data.
A standard ingestion framework consists of two components, namely, Data Collector and Data Integrator. While data collector is responsible for collecting or pulling the data from a data source, the data integrator component takes care of ingesting the data into the data lake. Implementation and design of data collector and integrator components can be flexible as per the big data technology stack.
Before we turn our discussion to ingestion challenges and principles, let us see what are the operating modes of data ingestion. It can operate either in real-time or batch mode. By virtue of their names, real-time mode () means that changes are applied to data lake as soon as they happen, while a batched mode ingestion applies the changes in batches. Keep note of the fact that real-time, though named, has its own share of lag between change event and application. For this reason, real-time can be fairly underestood as near real-time. The factors that determine the ingestion operating mode are data change rate at source and volume of this change. Data change rate is a measure of count of changes occurring every hour.
For real-time ingestion mode, a change data capture (CDC) system can suffice the ingestion requirements. The change data capture framework reads the changes from transaction logs and are replicated in the data lake. Data latency between capture and integration phases is very minimal. Top software vendors like Oracle, HVR, Talend, Informatica, Pentaho, IBM provide data integration tools that operate in real time mode.
In a batched ingestion mode, changes are captured and persisted every defined interval of time, and applied to data lake in chunks. Data latency is the time gap between the capture and integration jobs.
The chapter talks about batched/real-time Data Ingestion Framework considerations, challenges, different file formats and their use-cases, LinkedIn’s databus approach, Apache Sqoop, Apache Flume, and some interesting concepts.