Why replicate a data lake? Building an enterprise data lake demands heavy weightlifting in terms of cost, planning, technical design, architecture, and operational excellence. Building another infrastructure as a standby will shoot up both capex and opex. Therefore, before proceeding to the planning phase, one must have a strong justification and clear objective behind setting up a replica (s). At a high level, two parameters justify a replication exercise – availability and recoverability.
You tend to achieve availability by having additional redundant supply of resources for tolerating a fault or an outage without incurring any or minimally accepted loss to the business. Recoverability can be achieved through an alternate standby site that holds “as-of-outage” state of data and can be restored within permissible time limits.
High availability can be achieved by planning high availability of member components of a site. Recoverability addresses bigger concern when entire site has to be failed over to its standby. Therefore, availability happens to be an eventual subset of recoverability. An active-passive site is an optimal approach that achieves recoverability and availability. An active-active setup is an idealistic state of ‘nirvana’ by enabling nearly active replicas to the business users, while treating each other as standby.
Disaster recovery factors
At a high level, disaster recovery strategy involves a backup site and switchover strategy. The nature of backup for disaster recovery is slightly different as the expectation from disaster recovery is to cope up from critical incidents. Let us list down the factors that play their part in formulating an efficient disaster recovery strategy.
- Understand data sources and data awareness – While setting up a disaster recovery site, it is always a better idea to understand ingredients of data lake. How critical are the system of records? Where do the source system exist? What is the impact if a data mirror layer is lost?
- Copying versus mirroring – Backup mode is an essential parameter of restoration exercise from disaster recovery site. Mirror images restore faster than backup copies.
- Backup frequency – The data change factor and service level agreements determine the frequency at which data flows into the disaster recovery site.
Dual Path ingestion vs Primary replication
Dual path is a two-way (T-like) ingestion from source to primary as well as standby at the same time. What it means is that a single capture process will be integrated into two different targets. Note that only mirror is maintained on standby and not the rest of the data zones.
On the other hand, primary replication is a one way replication from primary site to its standby. Benefit of primary replication is that all the layers of data lake are copied over to standby.
Data replication strategies
- DistCp – DistCp or Distribution Copy is one of the common data copy solutions for hadoop file systems; either within a data center or across remote data centers. It uses mapreduce under the hoods for data distribution and recovery. It translates list of directories and files under a namespace into map tasks and taskTrackers copy them over to target namespace.
The DISTCP utility is a mapreduce operation. It consumes mapper slots which may impact business operations in a data lake. In addition, since each datanode on the source site must have write access on target sites, the communication pattern between the two clusters is SN*TN [SN is the count of source data nodes, TN is the count of target data nodes]. In case the communication channel is not one-on-one configured among the data nodes, data replication from source to target may get impacted.
Another key consideration of distcp usage is hadoop version on source and target. With hdfs:// connection protocol, the source and target versions must be same. To switch on version independent transfer between source and target sites, enable data transfer over HTTP by using wedhdfs://. Another method of enabling HTTP-based transfer is using httpfs:// protocol, which uses HTTPfs proxy daemon for cluster communication. However, keep in mind that both webhdfs:// and httpfs:// need to be configured manually and are relatively slower than native hdfs:// connection.
- HDFS Snapshots – snapshots represent state of data lake at a point in time. HDFS snapshots can be build an as-is image of data lake. Also, they can be used to stitch data during accidental losses.
- Hive metastore replication – Hive supports metastore replication to other clusters with simple configuration in hdfs-site.xml file. Although custom replication frameworks are possible, but by default, system uses apache.hive.hcatalog.api.repl.exim.EximReplicationTaskFactory implementation for data capture, movement, and ingestion commands.
- Kafka mirror maker – Apache Kafka service that acts as a consumer in active kafka cluster and producer to standby kafka cluster.
Companies with universal outreach strive for data availability for globally spread-out teams. They require a robust data replication strategy that can encompass geographical challenges and possess the ability to handle voluminous data sets in near real-time (real-time would be incredibly welcomed though!).
We will continue our discussion in Part-2 of this series.