Practical Enterprise Data Lake Insights made it to the Best New Big Data Books

Practical Enterprise Data Lake Insights made it to the Best New Big Data Books
BookAuthority Best New Big Data Books

I’m happy to announce that my book, “Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake”, made it to BookAuthority’s Best New Big Data Books:
BookAuthority collects and ranks the best books in the world, and it is a great honor to get this kind of recognition. Thank you for all your support!
The book is available for purchase on Amazon.

My first book – “Oracle Advanced PL/SQL Developer Professional Guide” too got recognized as 10th best seller in “Oracle Certification” category. Check here –  Find it on Amazon.

Thanks to the readers and Bookauthority!






Data Lake Ingestion strategies

Numbers don’t lie. They empower you to be smart and stay decisive. Recently, I stumbled upon the bookmetrix portal that publishes by-chapter metrics of a book. Metrics are straight enough to discover book’s most loved chapters. Although the numbers were not skewed by high margin, I realized the fact that “Data Lake Ingestion Strategies” has been a “talk” among the readers. With that being said, I have decided to make the full chapter available for free for download.

You can download the free chapter here – Data Lake Ingestion strategies. Here is a quick synopsis –

What is data ingestion?

Data ingestion framework captures data from multiple data sources and ingests into big data lake. The framework securely connects to different sources, captures the changes, and replicates them in the data lake. Data ingestion framework keeps the data lake consistent with the data changes at the source systems; thus, making it a single station of enterprise data.

A standard ingestion framework consists of two components, namely, Data Collector and Data Integrator. While data collector is responsible for collecting or pulling the data from a data source, the data integrator component takes care of ingesting the data into the data lake. Implementation and design of data collector and integrator components can be flexible as per the big data technology stack.

Before we turn our discussion to ingestion challenges and principles, let us see what are the operating modes of data ingestion. It can operate either in real-time or batch mode. By virtue of their names, real-time mode () means that changes are applied to data lake as soon as they happen, while a batched mode ingestion applies the changes in batches. Keep note of the fact that real-time, though named, has its own share of lag between change event and application. For this reason, real-time can be fairly underestood as near real-time. The factors that determine the ingestion operating mode are data change rate at source and volume of this change. Data change rate is a measure of count of changes occurring every hour.

For real-time ingestion mode, a change data capture (CDC) system can suffice the ingestion requirements. The change data capture framework reads the changes from transaction logs and are replicated in the data lake. Data latency between capture and integration phases is very minimal. Top software vendors like Oracle, HVR, Talend, Informatica, Pentaho, IBM provide data integration tools that operate in real time mode.

In a batched ingestion mode, changes are captured and persisted every defined interval of time, and applied to data lake in chunks. Data latency is the time gap between the capture and integration jobs.

The chapter talks about batched/real-time Data Ingestion Framework considerations, challenges, different file formats and their use-cases, LinkedIn’s databus approach, Apache Sqoop, Apache Flume, and some interesting concepts.


For rest of the chapters, you can purchase the paperback or kindle copy from Amazon or Apress.





Hey All,

Faced some really proud moments last week. I was one of the only 100 winners of NEXT100 CIO 2018 award. NEXT100 is an annual awards program instituted by IT Next magazine and the 9.9 Group that aims to identify 100 experienced IT managers who have the skills, talent and spirit to become CIOs. The awards process invites aspirants to self-nominate themselves for consideration and qualification for the award.

The process requires registrants to submit their applications, undergo psychometric tests, secure recommendations, and clear jury interviews. Jury panel is formulated out of elite senior leadership from the industry – so you get a fair chance of presenting yourself and go through a real liberal “conversation”.

So – what ended well, must have started well. Me and my manager got notified through email that our organization has a “NEXT100 CIO” winner. It was a great moment and wishes started pouring in. Within a week, I, along with other fellow winner (2 of them from my organization), attended the felicitation ceremony in Delhi.

My NEXT100 CIO profile –

I really feel honored to be part of this selective pool of future CIOs. More than the role tagged with the award, it is the rigorous selection process and opulent handling that makes it eminent and reputable.

The felicitation ceremony – well planned, great venue, great people, and the winners for disparate industry verticals. Ample opportunities to network and make some C-suite friends.

I would like to thank IT Next for this honor. Thanks for helping at each and every step of the process. Thanks to all mentors I had in my career – who inculcated the attributes to lead and drive.

List of all NEXT100 CIO 2018 winners –

Grab your NEXT100 CIO 2018 Coffee book here –


Design considerations to replicate Data Lake site for Availability

Why replicate a data lake? Building an enterprise data lake demands heavy weightlifting in terms of cost, planning, technical design, architecture, and operational excellence. Building another infrastructure as a standby will shoot up both capex and opex. Therefore, before proceeding to the planning phase, one must have a strong justification and clear objective behind setting up a replica (s). At a high level, two parameters justify a replication exercise – availability and recoverability.

You tend to achieve availability by having additional redundant supply of resources for tolerating a fault or an outage without incurring any or minimally accepted loss to the business. Recoverability can be achieved through an alternate standby site that holds “as-of-outage” state of data and can be restored within permissible time limits.

High availability can be achieved by planning high availability of member components of a site. Recoverability addresses bigger concern when entire site has to be failed over to its standby. Therefore, availability happens to be an eventual subset of recoverability. An active-passive site is an optimal approach that achieves recoverability and availability. An active-active setup is an idealistic state of ‘nirvana’ by enabling nearly active replicas to the business users, while treating each other as standby.

Disaster recovery factors

At a high level, disaster recovery strategy involves a backup site and switchover strategy. The nature of backup for disaster recovery is slightly different as the expectation from disaster recovery is to cope up from critical incidents. Let us list down the factors that play their part in formulating an efficient disaster recovery strategy.

  1. Understand data sources and data awareness – While setting up a disaster recovery site, it is always a better idea to understand ingredients of data lake. How critical are the system of records? Where do the source system exist? What is the impact if a data mirror layer is lost?
  2. Copying versus mirroring – Backup mode is an essential parameter of restoration exercise from disaster recovery site. Mirror images restore faster than backup copies.
  3. Backup frequency – The data change factor and service level agreements determine the frequency at which data flows into the disaster recovery site.

Dual Path ingestion vs Primary replication

Dual path is a two-way (T-like) ingestion from source to primary as well as standby at the same time. What it means is that a single capture process will be integrated into two different targets. Note that only mirror is maintained on standby and not the rest of the data zones.

On the other hand, primary replication is a one way replication from primary site to its standby. Benefit of primary replication is that all the layers of data lake are copied over to standby.

Data replication strategies

  1. DistCp – DistCp or Distribution Copy is one of the common data copy solutions for hadoop file systems; either within a data center or across remote data centers. It uses mapreduce under the hoods for data distribution and recovery. It translates list of directories and files under a namespace into map tasks and taskTrackers copy them over to target namespace.

The DISTCP utility is a mapreduce operation. It consumes mapper slots which may impact business operations in a data lake. In addition, since each datanode on the source site must have write access on target sites, the communication pattern between the two clusters is SN*TN [SN is the count of source data nodes, TN is the count of target data nodes]. In case the communication channel is not one-on-one configured among the data nodes, data replication from source to target may get impacted.

Another key consideration of distcp usage is hadoop version on source and target. With hdfs:// connection protocol, the source and target versions must be same. To switch on version independent transfer between source and target sites, enable data transfer over HTTP by using wedhdfs://. Another method of enabling HTTP-based transfer is using httpfs:// protocol, which uses HTTPfs proxy daemon for cluster communication. However, keep in mind that both webhdfs:// and httpfs:// need to be configured manually and are relatively slower than native hdfs:// connection.

  1. HDFS Snapshots – snapshots represent state of data lake at a point in time. HDFS snapshots can be build an as-is image of data lake. Also, they can be used to stitch data during accidental losses.
  2. Hive metastore replication – Hive supports metastore replication to other clusters with simple configuration in hdfs-site.xml file. Although custom replication frameworks are possible, but by default, system uses apache.hive.hcatalog.api.repl.exim.EximReplicationTaskFactory implementation for data capture, movement, and ingestion commands.
  3. Kafka mirror maker – Apache Kafka service that acts as a consumer in active kafka cluster and producer to standby kafka cluster.

Companies with universal outreach strive for data availability for globally spread-out teams. They require a robust data replication strategy that can encompass geographical challenges and possess the ability to handle voluminous data sets in near real-time (real-time would be incredibly welcomed though!).

We will continue our discussion in Part-2 of this series.


Achieve Data Democracy with effective Data Integration

For straight fourth year, I was pleased to be the part of AIOUG ODevCYatra as a speaker. This year, AIOUG re-branded the ages old OTNYathra as ODevCYatra to align its focus to the fatty developer community. The thought and execution remained the same though – 7 cities, 35 speakers, and a very tight schedule. I was part of Bengaluru version of AIOUG ODevCYatra.

My session

“Achieve Data Democratization with effective Data Integration”

Data lake is relatively a new term when compared to all fancy ones, since the industry realized the potential of data. Enterprises are bending their backwards to build a stable data strategy and take a leap towards data
democratization. Traditional approaches pertaining to data pipelines, data processing, data security still hold true but architects need to scoot an extra mile while designing a big data lake. This session will focus on data integration design considerations. We will discuss the relevance of data democratization in the organizational data strategy.

Learning objectives

✓ Data Lake architectural styles
✓ Implement data lake for democratization
✓ Data Ingestion framework principles

Response – Trust me, I had expected a very low turn out because the topic did sound out-of-the-blue amidst a pro-Oracle event. In an event focused on Oracle cloud offering, in-memory, tuning, and autonomous data warehouse, it was a risk to get on to big data and data lakes (not in Sangam though!).  The hunt was on for classical data-architects until noon. However, to my surprise, there were 10 heads who left the likes of DBIM Real time analytics and Cloud migration sessions to attend mine. I’m thankful to all those who attended my session and I think we got on to some very healthy discussions related to Data Lakes, data governance, and data democracy.

Session download – My session deck is available here –

If you have or had any questions regarding the session or content, feel free to comment or mail. Will be happy to discuss.

See you all again in events like this!

What to expect from your Data Lake

Big Data IT is driven by competition. Organizations want to exploit the power of data analytics at a manageable cost to stand out of their competitors. Data lake provides scalable framework to store massive volumes of data and churn out analytical insights that can help them in effective decision making and grow new markets. It brings in a shift of focus from protracted and laborious infrastructure planning exercise to data-driven architecture. Data ingestion and processing framework becomes the cornerstone rather than just the storage.

Let us understand key characteristics of an enterprise data lake –

  1. Data lake must be built using a scalable and fault tolerant framework – data lake concept focuses upon simplification and cost reduction without compromising on quality and availability. Apache Hadoop provides cost benefits by running on commodity machines and brings resiliency and scalability as well.
  2. Availability – data in the data lake must be accurate and available to all consumers as soon as it is ingested.
  3. Accessibility – shared access model to ensure data can be accessed by all applications. Unless required at the consumption layer, data shards are not a recommended design within the data lake. Another key aspect is data privacy and exchange regulations. Data governance council is expected to formulate norms on data access, privacy and movement.
  4. Data governance policies must not enforce constraints on data – data governance intends to control the level of democracy within the data lake. Its sole purpose of existence is to maintain quality level through audits, compliance, and timely checks. Data flow, either by its size or quality, must not be constrained through governance norms.
  5. Data in the data lake should never be disposed. Data driven strategy must define steps to version the data and handle deletes and updates from the source systems
  6. Support for in-place data analytics – Data lake is a singular view from all the source systems to empower in-house data analytics. Downstream applications can extract the data from consumption layer to feed a disparate application.
  7. Data security – Security is quite a critical piece of data lake. Enterprises can start with reactive approach and strive for proactive ways to detect vulnerabilities. Data-centric security models should be capable of building real-time risk profile that can help detect anomalies in user access or privileges.
  8. Archival strategy – as part of ILM strategy (Information Lifecycle Management), data retention policies must be created. Retention factor of data, that resides in relatively “cold” region of lake, must be given a thought. Not a big deal in hadoop world though, but storage consumption multiplied with new data exploration, brings a lot of wisdom to formulate data archival strategy.

Another perspective that comes with data lake building is the simplified infrastructure. Organizations spend a lot in building reliable stack for different nature of data and usually follow best fit approach to manage data. For example, relational databases have been the industry de-facto for structured data for ages. For the semi-structured and unstructured data coming through sensors, web logs, social media, traditional file systems were being used widely. End of the day, they have data marts to bracketed by data structure, use cases and customer needs but incur high capital and operational expenses.

P.S. The post is an excerpt from Apress’s “Practical Enterprise Data Lake Insights”


#Apress “Practical Enterprise Data Lake Insights” – Published!

Hello All,

Gives me immense pleasure to announce the release of our book “Practical Enterprise Data Lake Insights” with Apress. The book takes an end-to-end solution approach in a data lake environment that includes data capture, processing, security, and availability. Credits to the co-author of the book, Venkata Giri and technical reviewer, Sai Sundar.

The book is now available at various channels as subscription and in print (on request!) and e-book (e.g., Amazon/Kindle, Barnes & Noble/nook, Below are the Apress and Amazon links –

Apress –

Amazon –

Thank you for all your confidence, support, and encouragement. Thanks Monica Caldas, CIO (GE Transportation) for helping us with the Foreword column.

Brief about the book –

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.

Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point. Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.


What You’ll Learn:

  • Get to know data lake architecture and design principles
  • Implement data capture and streaming strategies
  • Implement data processing strategies in Hadoop
  • Understand the data lake security framework and availability model

Grab your copies fast. Enjoy reading!