Achieve Data Democracy with effective Data Integration

For straight fourth year, I was pleased to be the part of AIOUG ODevCYatra as a speaker. This year, AIOUG re-branded the ages old OTNYathra as ODevCYatra to align its focus to the fatty developer community. The thought and execution remained the same though – 7 cities, 35 speakers, and a very tight schedule. I was part of Bengaluru version of AIOUG ODevCYatra.

My session

“Achieve Data Democratization with effective Data Integration”

Data lake is relatively a new term when compared to all fancy ones, since the industry realized the potential of data. Enterprises are bending their backwards to build a stable data strategy and take a leap towards data
democratization. Traditional approaches pertaining to data pipelines, data processing, data security still hold true but architects need to scoot an extra mile while designing a big data lake. This session will focus on data integration design considerations. We will discuss the relevance of data democratization in the organizational data strategy.

Learning objectives

✓ Data Lake architectural styles
✓ Implement data lake for democratization
✓ Data Ingestion framework principles

Response – Trust me, I had expected a very low turn out because the topic did sound out-of-the-blue amidst a pro-Oracle event. In an event focused on Oracle cloud offering, in-memory, tuning, and autonomous data warehouse, it was a risk to get on to big data and data lakes (not in Sangam though!).  The hunt was on for classical data-architects until noon. However, to my surprise, there were 10 heads who left the likes of DBIM Real time analytics and Cloud migration sessions to attend mine. I’m thankful to all those who attended my session and I think we got on to some very healthy discussions related to Data Lakes, data governance, and data democracy.

Session download – My session deck is available here – http://www.aioug.org/ODevCYatra/2018/SaurabhGupta_ODevCYatra-DataInDataLake.pdf

If you have or had any questions regarding the session or content, feel free to comment or mail. Will be happy to discuss.

See you all again in events like this!

Advertisements

What to expect from your Data Lake

Big Data IT is driven by competition. Organizations want to exploit the power of data analytics at a manageable cost to stand out of their competitors. Data lake provides scalable framework to store massive volumes of data and churn out analytical insights that can help them in effective decision making and grow new markets. It brings in a shift of focus from protracted and laborious infrastructure planning exercise to data-driven architecture. Data ingestion and processing framework becomes the cornerstone rather than just the storage.

Let us understand key characteristics of an enterprise data lake –

  1. Data lake must be built using a scalable and fault tolerant framework – data lake concept focuses upon simplification and cost reduction without compromising on quality and availability. Apache Hadoop provides cost benefits by running on commodity machines and brings resiliency and scalability as well.
  2. Availability – data in the data lake must be accurate and available to all consumers as soon as it is ingested.
  3. Accessibility – shared access model to ensure data can be accessed by all applications. Unless required at the consumption layer, data shards are not a recommended design within the data lake. Another key aspect is data privacy and exchange regulations. Data governance council is expected to formulate norms on data access, privacy and movement.
  4. Data governance policies must not enforce constraints on data – data governance intends to control the level of democracy within the data lake. Its sole purpose of existence is to maintain quality level through audits, compliance, and timely checks. Data flow, either by its size or quality, must not be constrained through governance norms.
  5. Data in the data lake should never be disposed. Data driven strategy must define steps to version the data and handle deletes and updates from the source systems
  6. Support for in-place data analytics – Data lake is a singular view from all the source systems to empower in-house data analytics. Downstream applications can extract the data from consumption layer to feed a disparate application.
  7. Data security – Security is quite a critical piece of data lake. Enterprises can start with reactive approach and strive for proactive ways to detect vulnerabilities. Data-centric security models should be capable of building real-time risk profile that can help detect anomalies in user access or privileges.
  8. Archival strategy – as part of ILM strategy (Information Lifecycle Management), data retention policies must be created. Retention factor of data, that resides in relatively “cold” region of lake, must be given a thought. Not a big deal in hadoop world though, but storage consumption multiplied with new data exploration, brings a lot of wisdom to formulate data archival strategy.

Another perspective that comes with data lake building is the simplified infrastructure. Organizations spend a lot in building reliable stack for different nature of data and usually follow best fit approach to manage data. For example, relational databases have been the industry de-facto for structured data for ages. For the semi-structured and unstructured data coming through sensors, web logs, social media, traditional file systems were being used widely. End of the day, they have data marts to bracketed by data structure, use cases and customer needs but incur high capital and operational expenses.

P.S. The post is an excerpt from Apress’s “Practical Enterprise Data Lake Insights”

 

#Apress “Practical Enterprise Data Lake Insights” – Published!

Hello All,

Gives me immense pleasure to announce the release of our book “Practical Enterprise Data Lake Insights” with Apress. The book takes an end-to-end solution approach in a data lake environment that includes data capture, processing, security, and availability. Credits to the co-author of the book, Venkata Giri and technical reviewer, Sai Sundar.

The book is now available at various channels as subscription and in print (on request!) and e-book (e.g., Amazon/Kindle, Barnes & Noble/nook, Apress.com). Below are the Apress and Amazon links –

Apress – https://www.apress.com/gb/book/9781484235218

Amazon – https://www.amazon.com/Practical-Enterprise-Data-Lake-Insights/dp/1484235215/

Thank you for all your confidence, support, and encouragement. Thanks Monica Caldas, CIO (GE Transportation) for helping us with the Foreword column.

Brief about the book –

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.

Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point. Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.

 

What You’ll Learn:

  • Get to know data lake architecture and design principles
  • Implement data capture and streaming strategies
  • Implement data processing strategies in Hadoop
  • Understand the data lake security framework and availability model

Grab your copies fast. Enjoy reading!

Saurabh

Harness the Power of Data in a Big Data Lake

Last year November, I got the opportunity to present at AIOUG Sangam, 2017. My session was titled as “Harness the Power of Data in a Big Data Lake”. The abstract is as below –

Data lake is relatively a new term when compared to all fancy ones since the industry realized the potential of data. Industry is planning their way out to adopt big data lake as the key data store but what challenges them is the traditional approach. Traditional approaches pertaining to data pipelines, data processing, data security still hold good but architects do need to leap an extra mile while designing big data lake.

This session will focus on this shift in approaches. We will explore what are the road blockers while setting up a data lake and how to size the key milestones. Health and efficiency of a data lake largely depends on two factors – data ingestion and data processing. Attend this session to learn key practices of data ingestion under different circumstances. Data processing for variety of scenarios will be covered as well.

Here is the link to my presentation –

Sangam17_DataInDataLake

The session was an excerpt from my upcoming book on Enterprise Data Lake. The book should be out within a month from now and is available at all online bookstores.

Amazon – https://www.amazon.com/Practical-Enterprise-Data-Lake-Insights/dp/1484235215/

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.
Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.

What You’ll find in the book

  • Get to know data lake architecture and design principles
  • Implement data capture and streaming strategies
  • Implement data processing strategies in Hadoop
  • Understand the data lake security framework and availability model

Enjoy reading!

Saurabh

My session at #AIOUG #Sangam16 – Transition from Oracle DBA to Big Data Architect

Big thanks to all those who turned up for my early morning session on Saturday Nov 12th, 2016. I know it was a tough call after a week’s work but thanks for making the right decision. A full-house is an extreme delight for a speaker.

You can download the session deck either from the Sangam website or from the link below.

Sangam16_TransformIntoBigDataArchitect

I hope the session was useful to all. If you have any doubts or comments, feel free to comment below. If you have a feedback on the session, I would surely love to hear.

I love to feature in AIOUG conferences and events; Sangam being one of them. In addition to attending session, we get a chance to meet and greet the geeks and techies from around the world. I must confess that I get to meet many of them only at events like this. While I was fortunate to meet Arup Nanda, Syed Jaffer Hussain, Aman, Nassyam, Sai, Satyendra, Kuassi Mensah, I had a pleasure of spending time with Oracle colleagues and many others during these two days.

Sangam 2016 was huge; it continues to grow by a fold. 100+ session in two days and distinguished speakers from all over the world. Thanks to AIOUG team and volunteers who coordinated and managed the event fairly well.

Thanks again!

Query materialized view refresh timestamp

Unlike my lengthy posts, this is really a quick one.

Alright, so how do you get to know when your materialized view got refreshed? Well, no biggies. There are bunch of dictionary views who capture refresh date but none of them give you timestamp. For fast refresh, you can work with SCN or timestamp based MView logs but for complete refresh M-views, this can be tricky and here is a quick easy solution to retrieve timestamp information.

You can query ALL_MVIEW_ANALYSIS dictionary view that captures System change number (SCN) of the last refresh operation (i.e start refresh time). Use SCN_TO_TIMESTAMP function to translate SCN into timestamp. Here is the query –

SELECT owner,
mview_name,
last_refresh_scn,
SCN_TO_TIMESTAMP (last_refresh_scn) refresh_timestamp
FROM all_mview_analysis
WHERE mview_name = <>;

Try it yourself. I recommend this dictionary table as it also lets you know the time taken in fast or full refresh (FULLREFRESHTIM/INCREFRESHTIM). Don’t miss this nice article “How long did Oracle materialized view refresh run?” by Ittichai C.

I/O Resource Management on Exadata

Consolidation is a key enabler for Oracle database deployments on both public and private clouds. Consolidation reduces the overall IT costs by optimizing the operational and capital expenses. In addition, it enhances the effective utilization of cloud resources. The Exadata database machine has been optimized to run schemas and databases with mixed workloads together, making it the best platform for consolidation.

Organizations follow different approaches to consolidate database instances. Some of the prominent approaches of consolidation are virtualization, schema based consolidation and database aggregation on a single high end physical server. Oracle Database 12c introduces Multitenant Architecture to allow secure consolidation of databases on cloud and achieve benefits like tenant isolation, manage many-as-one capability and enhanced customer satisfaction.

For effective database consolidation, Exadata makes use of Oracle resource management (database resource management, network resource management and I/O resource management). The Exadata IORM enhances the stability of mission critical applications and ensures availability of all databases which share the server resources. The I/O resource plan provides the framework for queuing the low-priority requests while issuing high-priority requests. This post will focus on configuring, enabling and monitoring IORM plans on Exadata database machines.

Oracle Database Resource Manager

On a database server, the resources are allocated by the operating system, which may be inappropriate and inefficient in maintaining database health. The server stability and database instance is impacted by the high CPU load, thus resulting in sub-optimal performance of the database. Oracle Database Resource Manager, first introduced in Oracle Database 8i, can help you by governing the allocation of resources to the database instance and assuring efficient utilization of CPU resources on the server. It is a database module which allocates the resources to a resource consumer group as per a set of plan directives in a fair way. A resource consumer group comprises of database sessions with “like” resource requirements.

 A resource plan can manage the allocation of CPU, disk I/Os and parallel servers among schemas in a single database or multiple databases in a consolidated environment. An intra-database plan can be created to manage allocation across multiple schemas or services within a single database. On an Exadata database machine, the disk I/Os can be managed across multiple databases using IO Resource Manager (IORM) or inter-database plan. The Oracle Database Resource Manager is a feature of Oracle Database Enterprise Edition. However, starting with Oracle Database 11g, it can also be used in Standard Edition to manage maintenance tasks through a default maintenance plan.

This post focuses on configuration of IORM plans on the Exadata database machine. In the post, we shall discuss how to manage disk I/Os, manage flash usage, manage standby database using IORM.

Read the complete post at Community.Oracle here – https://community.oracle.com/docs/DOC-998939