In this video:
Prepare for your dive into the deep waters of the data lake with IBM Distinguished Engineer and Master Inventor Mandy Chessell, IBM Financial Services Leader for Analytics Jon Asprey, IBM Europe Head of Data and Strategic Partnerships Lauren Walker, and Ron Collins, IBM UK Head of Cognitive Business Systems.
What is a data lake?
A data lake is a storage repository that holds an enormous amount of raw data in native format until it is accessed. The term data lake is usually associated with Hadoop-oriented object storage in which an organization’s data is loaded into the Hadoop platform and then business analytics and data-mining tools are applied to the data where it resides on the Hadoop cluster.
The term data lake is increasingly being used to describe any large data pool in which the schema and data requirements are not defined until the data is queried.
Isn’t a data lake the same as a data warehouse?
Not really. Each are optimized for different purposes.
For example, a hierarchical data warehouse stores data in files or folders while a data lake uses a flat architecture to store data. In a data lake, each data element is assigned a unique identifier and tagged with a set of extended metadata tags so that when a question arises, the lake can be queried for relevant data then that smaller set of data can be analyzed to find an answer.
Can a lake do things a warehouse can’t?
Because a lake stores data in its native format, it can store all kinds of data more cheaply and process it more quickly. The major overhead – defining data structure and requirements – comes only when the data is needed and only for the data that is used.
The types of data that can be stored in a lake are structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video).
Why is this important for developers?
Because, by many measurements, only 20 percent of the data we deal with today is structured data and as more and more types of devices join the Internet of Things, more types of data pop into existence. That means, to create a modern application, the developer must be able to not only accommodate the different types of data, she must be able to use them together to create a coherent picture for the user of the app.
Being able to access a data lake effectively means your application has access to vast amounts of raw data, giving you a sandbox perfect for developing and testing your app’s analytical model and then moving it quickly into production. It provides a broader range of user, system, and behavioral input variables to refine and improve your software over time. It also provides an almost endless library of data transformations and queries that have been tested and monitored – in other words, a list of what worked, what didn’t, and why.
This expanded, self-service access to data can accelerate analytics functions which means your data-dependent application moves faster.
Let’s look at the videos.
Meet this aquatic wonder
Mandy overviews the content contained in this series of videos and explains her definition of a data lake.
Increasing your rate of innovation
Jon explores areas of agility and innovation that open up as the result of a data lake and the key considerations in constructing and maintaining their effectiveness. For the developer, you get the business person’s perspective on the power of the data lake, including the use of the agile process – rapid generation and testing of ideas and prototypes. For programmers and designers, this sounds familiar. You need to be able to get your hands on the right kind of data to determine whether testing and prototyping are successful.
The data governance issue
Mandy explains that governance is not just about data quality control, but also needs to incorporate the value delivered to the business.
To govern a data source, first you have to have a solid definition of what it is you’re trying to govern. For an IT person, it might be technology interaction and data movement; the business person may look at controlling the process of innovation and data usage. The way to build a data lake governance system is to look at each user, determine how to create a control system for each, and then balance all the differing control systems into a single one.
For the developer whose app is accessing data in the lake, there will also be a system of balance and control and he needs to understand that governance in order to design software that’s optimized to access the data.
Getting an overall perspective
Lauren illustrates the value of a data lake to deliver new customer insights by combining your company data with key external sources. She discusses a simple journey between rising in the morning until you get to the office, highlighting all the digital “exhaust” you’ve put into the networked ecosphere and how so many companies are counting on this enormous volume of disparately structured data to get insights about your behavior.
“Fine for marketing people,” the typical programmer would say. “But what does that mean to me?” Simple. You make the app that this hypothetical customer uses as part of her everyday life. You want it to be faster and more responsive to each individual user’s needs. One way of doing that is to understand the type of data your user chooses the most and to be able to configure the app on the fly to be more receptive to that type of data, whether it is highly structured or totally raw. In essence, you want to make your app responsive to the structure of the data lake.
Building the correct architecture
Ron Collins outlines the key decisions to make at the start of a data lake project to ensure the outcome remains agile, intuitive, and cost-effective. Ron helps clients make the transition between data warehousing and the lake and one thing he’s noticed in his engagements is that getting the architecture right is key to making a smooth dive into the lake.
“We all remember the old days when we’d spend six months architecting and designing … clearly we can’t do that anymore and [what’s more,] we shouldn’t do that.”
Ron emphasizes that we need to be able to deliver a data solution at the speed of business. The key to that is to make strong architecture decisions early in the process. How many kinds of data will you have? Who governs the data? Where do the different types fit? But the main challenge we face, according to Ron, is how do we build as much structure up front without slowing down the project and hampering the agility the Hadoop-style data lake will give us.
Resources for you
- Explore the governed data lake approach
- Compare Big SQL vs Spark SQL at 100TB for data lakes
- Building big data analytics solutions in the cloud
- Maximize your data lake with a cloud or hybrid approach
- Don’t drown in the big data lake
- Explore the data lake at the IBM Big Data Hub blog
- Visit the BigInsights for Hadoop community