Data Lake Webinar Recap

Last Thursday I presented the webinar “From Pointless to Profitable: Using Data Lakes for Sustainable Analytics Innovation” to about 300 attendees. While we don’t consider webinar polling results valid data for research publication (too many concerns about survey sampling), webinar polls can offer some interesting directional insight.

I asked the audience two questions. First, I asked what the data lake concept meant to them. There were some surprises:
datalake webinar q1

The audience’s expectation for a data lake is as a platform to support self-service BI and analytics (36%), but also as a staging area for downstream analytics platforms (25%). It’s not unreasonable to combine these two together – the functionality for a data lake is largely the same in both cases. The users for each use case differ, as well as the tools, but it’s still the same data lake. A realistic approach is to think of these two use cases as a continuum. Self-service users first identify new or existing data sources that support a new result. Then, those data sources are processed, staged and moved to an optimized analytics platform.

It was reassuring to see smaller groups of respondents considering a data lake for a data warehouse replacement (9%) and as a single source for all operational and analytical workloads (15%). I expected these numbers to be higher based on overall market hype.

The second polling question asked what type of data lake audience members had implemented. Before I get into the results, I have to set some context. My colleague Svetlana Sicular identified three data lake architecture styles (see “Three Architecture Styles for a Useful Data Lake“):

  1. Inflow lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place.
  2. Outflow lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement. The outflow data lake is usually not the final destination for the data, but it may keep raw data long term to preserve the context for downstream data stores and applications.
  3. Data science lab: most suitable for data discovery and for developing new advanced analytics models — to increase the organization’s competitive advantage through new insights or innovation.

With that context in place, I asked the audience about their implementation:
datalake webinar q2

63% of respondents have yet implemented a data lake. That’s understandable. After all, they’re listening to a foundational webinar about the concept. The outflow lake was the most common architecture style (15%) and it’s also the type clients are asking about most frequently. Inflow and data science architectural styles tied at 11%.

The audience also asked some excellent questions. Many asked about securing and governing data lakes, a topic I’m hoping to address soon with Andrew White and Merv Adrian.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s