Do You Really Need a 'Data Lake' in Your Back Yard?

By Joe McKendrick June 16, 2014, 9:46 a.m. EDT 2 Min Read

Vendors in the data management space lately have been flinging about a new term, “data lake,” to describe the vast pools of data that are forming across enterprises.

What, exactly, is a data lake, and how does it differ from a virtualized data layer, or cloud-based storage? While it sounds an awful like the latest marketing buzzword, it actually offers an alternative to enterprise data warehouses. And, let's face it, a “lake”evokes much more positive imagery than that of a “warehouse.”

Data lakes are part of the Apache Hadoop ecosystem, serving as low-cost repositories for data of all types and sizes. Since data can be quickly poured into them with little fuss or muss, they're relatively low cost to operate, unlike data warehouses, which require ETL, cleaning and normalization of data.

Also see: 10 Business Intelligence Trends for 2014

“Data must be converted into recognizable formats a laborious time-consuming process that becomes increasingly impractical as data collections grow larger,” note Mark Herman and Michael Delurey, both of Booz, Allen and Hamilton, in a recent paper.

Essentially, all data coming into the organization regardless of whether it's structured or unstructured is assembled into a single, large table. Herman and Deluray liken this centralized table to a gigantic spreadsheet with billions of rows and billions of columns. This makes all data available at once to any and all queries.

“One of the main appeals of data lakes is that they incorporate data from any source, from social media to clickstream data, into a single location that empowers enterprises to capitalize on this information,” write Cesar Rojas of Teradata and Audrey Ng of Hortonworks in a new report published by The Data Warehouse Institute (TDWI).

What's the advantage of data lakes to insurance companies? Much of the data that is valuable to the policyholder application and claims administration processes is based on a lot of unstructured data: notes from agents, call center notes, photos of properties before and after damage, sensor data from telematics, geospatial data and social media data, just to name a few. The ability to put all this information together, vs. out in separate systems, such as content management, policy administration, and so forth, may enable faster access, at lower costs.

In a separate report, Teradata and Hortonworks provide key steps for data lake development:

1) Get the plumbing in place. As Hadoop is rolled out, the data lake can start as a small pilot project. In the meantime, everyone learns how to make this new way of looking at data work.

2) Build transformation and analytics muscle. “The second stage involves improving the ability to transform and analyze data,” the report notes. :In this stage, companies find the tools that are most appropriate to their skillset and start acquiring more data and building applications.” Capabilities from the enterprise data warehouse and the data lake are used together.”

3) Broaden the operational impact. “The third stage involves getting data and analytics into the hands of as many people as possible,” the report states.

4) Add enterprise capabilities. “In this highest stage of the data lake, enterprise capabilities are added to the data lake. Few companies have reached this level of maturity, but many will as the use of big data grows, requiring governance, compliance, security, and auditing.”

Joe McKendrick

Dig In contributor