Big Data IT is driven by competition. Organizations want to exploit the power of data analytics at a manageable cost to stand out of their competitors. Data lake provides scalable framework to store massive volumes of data and churn out analytical insights that can help them in effective decision making and grow new markets. It brings in a shift of focus from protracted and laborious infrastructure planning exercise to data-driven architecture. Data ingestion and processing framework becomes the cornerstone rather than just the storage.
Let us understand key characteristics of an enterprise data lake –
- Data lake must be built using a scalable and fault tolerant framework – data lake concept focuses upon simplification and cost reduction without compromising on quality and availability. Apache Hadoop provides cost benefits by running on commodity machines and brings resiliency and scalability as well.
- Availability – data in the data lake must be accurate and available to all consumers as soon as it is ingested.
- Accessibility – shared access model to ensure data can be accessed by all applications. Unless required at the consumption layer, data shards are not a recommended design within the data lake. Another key aspect is data privacy and exchange regulations. Data governance council is expected to formulate norms on data access, privacy and movement.
- Data governance policies must not enforce constraints on data – data governance intends to control the level of democracy within the data lake. Its sole purpose of existence is to maintain quality level through audits, compliance, and timely checks. Data flow, either by its size or quality, must not be constrained through governance norms.
- Data in the data lake should never be disposed. Data driven strategy must define steps to version the data and handle deletes and updates from the source systems
- Support for in-place data analytics – Data lake is a singular view from all the source systems to empower in-house data analytics. Downstream applications can extract the data from consumption layer to feed a disparate application.
- Data security – Security is quite a critical piece of data lake. Enterprises can start with reactive approach and strive for proactive ways to detect vulnerabilities. Data-centric security models should be capable of building real-time risk profile that can help detect anomalies in user access or privileges.
- Archival strategy – as part of ILM strategy (Information Lifecycle Management), data retention policies must be created. Retention factor of data, that resides in relatively “cold” region of lake, must be given a thought. Not a big deal in hadoop world though, but storage consumption multiplied with new data exploration, brings a lot of wisdom to formulate data archival strategy.
Another perspective that comes with data lake building is the simplified infrastructure. Organizations spend a lot in building reliable stack for different nature of data and usually follow best fit approach to manage data. For example, relational databases have been the industry de-facto for structured data for ages. For the semi-structured and unstructured data coming through sensors, web logs, social media, traditional file systems were being used widely. End of the day, they have data marts to bracketed by data structure, use cases and customer needs but incur high capital and operational expenses.
P.S. The post is an excerpt from Apress’s “Practical Enterprise Data Lake Insights”