Design considerations to replicate Data Lake site for Availability

Why replicate a data lake? Building an enterprise data lake demands heavy weightlifting in terms of cost, planning, technical design, architecture, and operational excellence. Building another infrastructure as a standby will shoot up both capex and opex. Therefore, before proceeding to the planning phase, one must have a strong justification and clear objective behind setting up a replica (s). At a high level, two parameters justify a replication exercise – availability and recoverability.

You tend to achieve availability by having additional redundant supply of resources for tolerating a fault or an outage without incurring any or minimally accepted loss to the business. Recoverability can be achieved through an alternate standby site that holds “as-of-outage” state of data and can be restored within permissible time limits.

High availability can be achieved by planning high availability of member components of a site. Recoverability addresses bigger concern when entire site has to be failed over to its standby. Therefore, availability happens to be an eventual subset of recoverability. An active-passive site is an optimal approach that achieves recoverability and availability. An active-active setup is an idealistic state of ‘nirvana’ by enabling nearly active replicas to the business users, while treating each other as standby.

Disaster recovery factors

At a high level, disaster recovery strategy involves a backup site and switchover strategy. The nature of backup for disaster recovery is slightly different as the expectation from disaster recovery is to cope up from critical incidents. Let us list down the factors that play their part in formulating an efficient disaster recovery strategy.

  1. Understand data sources and data awareness – While setting up a disaster recovery site, it is always a better idea to understand ingredients of data lake. How critical are the system of records? Where do the source system exist? What is the impact if a data mirror layer is lost?
  2. Copying versus mirroring – Backup mode is an essential parameter of restoration exercise from disaster recovery site. Mirror images restore faster than backup copies.
  3. Backup frequency – The data change factor and service level agreements determine the frequency at which data flows into the disaster recovery site.

Dual Path ingestion vs Primary replication

Dual path is a two-way (T-like) ingestion from source to primary as well as standby at the same time. What it means is that a single capture process will be integrated into two different targets. Note that only mirror is maintained on standby and not the rest of the data zones.

On the other hand, primary replication is a one way replication from primary site to its standby. Benefit of primary replication is that all the layers of data lake are copied over to standby.

Data replication strategies

  1. DistCp – DistCp or Distribution Copy is one of the common data copy solutions for hadoop file systems; either within a data center or across remote data centers. It uses mapreduce under the hoods for data distribution and recovery. It translates list of directories and files under a namespace into map tasks and taskTrackers copy them over to target namespace.

The DISTCP utility is a mapreduce operation. It consumes mapper slots which may impact business operations in a data lake. In addition, since each datanode on the source site must have write access on target sites, the communication pattern between the two clusters is SN*TN [SN is the count of source data nodes, TN is the count of target data nodes]. In case the communication channel is not one-on-one configured among the data nodes, data replication from source to target may get impacted.

Another key consideration of distcp usage is hadoop version on source and target. With hdfs:// connection protocol, the source and target versions must be same. To switch on version independent transfer between source and target sites, enable data transfer over HTTP by using wedhdfs://. Another method of enabling HTTP-based transfer is using httpfs:// protocol, which uses HTTPfs proxy daemon for cluster communication. However, keep in mind that both webhdfs:// and httpfs:// need to be configured manually and are relatively slower than native hdfs:// connection.

  1. HDFS Snapshots – snapshots represent state of data lake at a point in time. HDFS snapshots can be build an as-is image of data lake. Also, they can be used to stitch data during accidental losses.
  2. Hive metastore replication – Hive supports metastore replication to other clusters with simple configuration in hdfs-site.xml file. Although custom replication frameworks are possible, but by default, system uses apache.hive.hcatalog.api.repl.exim.EximReplicationTaskFactory implementation for data capture, movement, and ingestion commands.
  3. Kafka mirror maker – Apache Kafka service that acts as a consumer in active kafka cluster and producer to standby kafka cluster.

Companies with universal outreach strive for data availability for globally spread-out teams. They require a robust data replication strategy that can encompass geographical challenges and possess the ability to handle voluminous data sets in near real-time (real-time would be incredibly welcomed though!).

We will continue our discussion in Part-2 of this series.

 

Advertisements

Achieve Data Democracy with effective Data Integration

For straight fourth year, I was pleased to be the part of AIOUG ODevCYatra as a speaker. This year, AIOUG re-branded the ages old OTNYathra as ODevCYatra to align its focus to the fatty developer community. The thought and execution remained the same though – 7 cities, 35 speakers, and a very tight schedule. I was part of Bengaluru version of AIOUG ODevCYatra.

My session

“Achieve Data Democratization with effective Data Integration”

Data lake is relatively a new term when compared to all fancy ones, since the industry realized the potential of data. Enterprises are bending their backwards to build a stable data strategy and take a leap towards data
democratization. Traditional approaches pertaining to data pipelines, data processing, data security still hold true but architects need to scoot an extra mile while designing a big data lake. This session will focus on data integration design considerations. We will discuss the relevance of data democratization in the organizational data strategy.

Learning objectives

✓ Data Lake architectural styles
✓ Implement data lake for democratization
✓ Data Ingestion framework principles

Response – Trust me, I had expected a very low turn out because the topic did sound out-of-the-blue amidst a pro-Oracle event. In an event focused on Oracle cloud offering, in-memory, tuning, and autonomous data warehouse, it was a risk to get on to big data and data lakes (not in Sangam though!).  The hunt was on for classical data-architects until noon. However, to my surprise, there were 10 heads who left the likes of DBIM Real time analytics and Cloud migration sessions to attend mine. I’m thankful to all those who attended my session and I think we got on to some very healthy discussions related to Data Lakes, data governance, and data democracy.

Session download – My session deck is available here – http://www.aioug.org/ODevCYatra/2018/SaurabhGupta_ODevCYatra-DataInDataLake.pdf

If you have or had any questions regarding the session or content, feel free to comment or mail. Will be happy to discuss.

See you all again in events like this!

What to expect from your Data Lake

Big Data IT is driven by competition. Organizations want to exploit the power of data analytics at a manageable cost to stand out of their competitors. Data lake provides scalable framework to store massive volumes of data and churn out analytical insights that can help them in effective decision making and grow new markets. It brings in a shift of focus from protracted and laborious infrastructure planning exercise to data-driven architecture. Data ingestion and processing framework becomes the cornerstone rather than just the storage.

Let us understand key characteristics of an enterprise data lake –

  1. Data lake must be built using a scalable and fault tolerant framework – data lake concept focuses upon simplification and cost reduction without compromising on quality and availability. Apache Hadoop provides cost benefits by running on commodity machines and brings resiliency and scalability as well.
  2. Availability – data in the data lake must be accurate and available to all consumers as soon as it is ingested.
  3. Accessibility – shared access model to ensure data can be accessed by all applications. Unless required at the consumption layer, data shards are not a recommended design within the data lake. Another key aspect is data privacy and exchange regulations. Data governance council is expected to formulate norms on data access, privacy and movement.
  4. Data governance policies must not enforce constraints on data – data governance intends to control the level of democracy within the data lake. Its sole purpose of existence is to maintain quality level through audits, compliance, and timely checks. Data flow, either by its size or quality, must not be constrained through governance norms.
  5. Data in the data lake should never be disposed. Data driven strategy must define steps to version the data and handle deletes and updates from the source systems
  6. Support for in-place data analytics – Data lake is a singular view from all the source systems to empower in-house data analytics. Downstream applications can extract the data from consumption layer to feed a disparate application.
  7. Data security – Security is quite a critical piece of data lake. Enterprises can start with reactive approach and strive for proactive ways to detect vulnerabilities. Data-centric security models should be capable of building real-time risk profile that can help detect anomalies in user access or privileges.
  8. Archival strategy – as part of ILM strategy (Information Lifecycle Management), data retention policies must be created. Retention factor of data, that resides in relatively “cold” region of lake, must be given a thought. Not a big deal in hadoop world though, but storage consumption multiplied with new data exploration, brings a lot of wisdom to formulate data archival strategy.

Another perspective that comes with data lake building is the simplified infrastructure. Organizations spend a lot in building reliable stack for different nature of data and usually follow best fit approach to manage data. For example, relational databases have been the industry de-facto for structured data for ages. For the semi-structured and unstructured data coming through sensors, web logs, social media, traditional file systems were being used widely. End of the day, they have data marts to bracketed by data structure, use cases and customer needs but incur high capital and operational expenses.

P.S. The post is an excerpt from Apress’s “Practical Enterprise Data Lake Insights”

 

#Apress “Practical Enterprise Data Lake Insights” – Published!

Hello All,

Gives me immense pleasure to announce the release of our book “Practical Enterprise Data Lake Insights” with Apress. The book takes an end-to-end solution approach in a data lake environment that includes data capture, processing, security, and availability. Credits to the co-author of the book, Venkata Giri and technical reviewer, Sai Sundar.

The book is now available at various channels as subscription and in print (on request!) and e-book (e.g., Amazon/Kindle, Barnes & Noble/nook, Apress.com). Below are the Apress and Amazon links –

Apress – https://www.apress.com/gb/book/9781484235218

Amazon – https://www.amazon.com/Practical-Enterprise-Data-Lake-Insights/dp/1484235215/

Thank you for all your confidence, support, and encouragement. Thanks Monica Caldas, CIO (GE Transportation) for helping us with the Foreword column.

Brief about the book –

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.

Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point. Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.

 

What You’ll Learn:

  • Get to know data lake architecture and design principles
  • Implement data capture and streaming strategies
  • Implement data processing strategies in Hadoop
  • Understand the data lake security framework and availability model

Grab your copies fast. Enjoy reading!

Saurabh

Harness the Power of Data in a Big Data Lake

Last year November, I got the opportunity to present at AIOUG Sangam, 2017. My session was titled as “Harness the Power of Data in a Big Data Lake”. The abstract is as below –

Data lake is relatively a new term when compared to all fancy ones since the industry realized the potential of data. Industry is planning their way out to adopt big data lake as the key data store but what challenges them is the traditional approach. Traditional approaches pertaining to data pipelines, data processing, data security still hold good but architects do need to leap an extra mile while designing big data lake.

This session will focus on this shift in approaches. We will explore what are the road blockers while setting up a data lake and how to size the key milestones. Health and efficiency of a data lake largely depends on two factors – data ingestion and data processing. Attend this session to learn key practices of data ingestion under different circumstances. Data processing for variety of scenarios will be covered as well.

Here is the link to my presentation –

Sangam17_DataInDataLake

The session was an excerpt from my upcoming book on Enterprise Data Lake. The book should be out within a month from now and is available at all online bookstores.

Amazon – https://www.amazon.com/Practical-Enterprise-Data-Lake-Insights/dp/1484235215/

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.
Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.

What You’ll find in the book

  • Get to know data lake architecture and design principles
  • Implement data capture and streaming strategies
  • Implement data processing strategies in Hadoop
  • Understand the data lake security framework and availability model

Enjoy reading!

Saurabh

My session at #AIOUG #Sangam16 – Transition from Oracle DBA to Big Data Architect

Big thanks to all those who turned up for my early morning session on Saturday Nov 12th, 2016. I know it was a tough call after a week’s work but thanks for making the right decision. A full-house is an extreme delight for a speaker.

You can download the session deck either from the Sangam website or from the link below.

Sangam16_TransformIntoBigDataArchitect

I hope the session was useful to all. If you have any doubts or comments, feel free to comment below. If you have a feedback on the session, I would surely love to hear.

I love to feature in AIOUG conferences and events; Sangam being one of them. In addition to attending session, we get a chance to meet and greet the geeks and techies from around the world. I must confess that I get to meet many of them only at events like this. While I was fortunate to meet Arup Nanda, Syed Jaffer Hussain, Aman, Nassyam, Sai, Satyendra, Kuassi Mensah, I had a pleasure of spending time with Oracle colleagues and many others during these two days.

Sangam 2016 was huge; it continues to grow by a fold. 100+ session in two days and distinguished speakers from all over the world. Thanks to AIOUG team and volunteers who coordinated and managed the event fairly well.

Thanks again!

Query materialized view refresh timestamp

Unlike my lengthy posts, this is really a quick one.

Alright, so how do you get to know when your materialized view got refreshed? Well, no biggies. There are bunch of dictionary views who capture refresh date but none of them give you timestamp. For fast refresh, you can work with SCN or timestamp based MView logs but for complete refresh M-views, this can be tricky and here is a quick easy solution to retrieve timestamp information.

You can query ALL_MVIEW_ANALYSIS dictionary view that captures System change number (SCN) of the last refresh operation (i.e start refresh time). Use SCN_TO_TIMESTAMP function to translate SCN into timestamp. Here is the query –

SELECT owner,
mview_name,
last_refresh_scn,
SCN_TO_TIMESTAMP (last_refresh_scn) refresh_timestamp
FROM all_mview_analysis
WHERE mview_name = <>;

Try it yourself. I recommend this dictionary table as it also lets you know the time taken in fast or full refresh (FULLREFRESHTIM/INCREFRESHTIM). Don’t miss this nice article “How long did Oracle materialized view refresh run?” by Ittichai C.