It’s been roughly a year since I started talking about DataOps. It was an accident, something I mentioned during a presentation on data engineering. But that slip attracted the interest of several vendors using the term and I thought I was seeing the start of the next differentiating practice in data and analytics. I’m still waiting for that.Continue reading →
Amazon Web Services is making a habit of disrupting smaller enterprise software vendors. At its re:Invent conference, AWS caused quite a bit of pearl-clutching in various open source communities for its managed Apache Kafka service. The company was accused of strip-mining open source while failing to contribute back to the communities it was appropriating software from.
Last week, AWS went further by announcing a new DBMS with a MongoDB-compatible API (based on the 3.6 version). MongoDB responded predictably, but the Amazon DocumentDB announcement didn’t trigger the same reaction from the OSS community. I imagine there’s far less sympathy for MongoDB after it relicensed as a proprietary product. There have been several takes about what AWS’ announcement means for open source software, but I believe those miss the point. The point isn’t about open source. The point is about delivering what customers value and what they don’t.
The majority simply don’t value open source. In certain cases, customers value their relationships with the vendor, but only when the vendor is an engineering partner instead of merely a rent-seeker. However, those instances are exceedingly rare. Customers don’t value operational opacity and complexity, especially for technologies with extremely limited skills available in the market.
Amazon Web Services hasn’t capitalized on open source software. It has capitalized on customer demand for removing complexity. Kafka and MongoDB won’t be the last OSS projects to get blindsided by the cloud providers. I can think of at least two other “open core” enterprise software companies with overly complex products that end users would love to have someone else manage.
Data visualizations increasingly inform our daily decisions. Traffic visualizations inform which route to take to the office, business intelligence dashboards indicate how you’re doing on projects and key performance indicators. And data collected by fitness trackers tell you how close you are (or aren’t) to reaching your weight loss or fitness goals.
Google and Walmart have announced a partnership where Google Home users can purchase Walmart’s products using voice ordering. As Recode points out, the intent of the partnership is to blunt Amazon’s initial foray into voice-based ordering. Coming at this from the data and analytics perspective, my first question is what happens to the customer data from, potentially, millions of orders?
Google’s partnership position is clearly more advantageous than Walmart’s. For Google, the data from voice-based ordering is likely to be combined with the existing customer profile it already has and will feed its advertising efforts. Obviously Walmart also gets the order data, but who else? Can Google resell that data to other parties? These details weren’t included in the partnership announcement, but Google’s terms and conditions make it clear that they can use data however it sees fit.
As partnerships between consumer-centric companies proliferate, the questions about who owns customer data and how it is used must become prominent questions for both the companies involved and the impacted consumers. After all, consumers provide the data that drives revenues for companies like Google.
Last Thursday I presented the webinar “From Pointless to Profitable: Using Data Lakes for Sustainable Analytics Innovation” to about 300 attendees. While we don’t consider webinar polling results valid data for research publication (too many concerns about survey sampling), webinar polls can offer some interesting directional insight.
The audience’s expectation for a data lake is as a platform to support self-service BI and analytics (36%), but also as a staging area for downstream analytics platforms (25%). It’s not unreasonable to combine these two together – the functionality for a data lake is largely the same in both cases. The users for each use case differ, as well as the tools, but it’s still the same data lake. A realistic approach is to think of these two use cases as a continuum. Self-service users first identify new or existing data sources that support a new result. Then, those data sources are processed, staged and moved to an optimized analytics platform.
It was reassuring to see smaller groups of respondents considering a data lake for a data warehouse replacement (9%) and as a single source for all operational and analytical workloads (15%). I expected these numbers to be higher based on overall market hype.
The second polling question asked what type of data lake audience members had implemented. Before I get into the results, I have to set some context. My colleague Svetlana Sicular identified three data lake architecture styles (see “Three Architecture Styles for a Useful Data Lake“):
- Inflow lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place.
- Outflow lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement. The outflow data lake is usually not the final destination for the data, but it may keep raw data long term to preserve the context for downstream data stores and applications.
- Data science lab: most suitable for data discovery and for developing new advanced analytics models — to increase the organization’s competitive advantage through new insights or innovation.
63% of respondents have yet implemented a data lake. That’s understandable. After all, they’re listening to a foundational webinar about the concept. The outflow lake was the most common architecture style (15%) and it’s also the type clients are asking about most frequently. Inflow and data science architectural styles tied at 11%.