Understanding a the value of a Data Lake House

By Published On: May 17, 2023

Overview

This episode of the CTO Advisor podcast discusses IBM Watsonx.Data, a part of IBM’s AI platform and toolset. The conversation focuses on the concept of a data lake house and its role in data governance and analytics. Tony Baer, Principal of DB Insight joins Keith in today’s discussion.

The CTO Advisor
The CTO Advisor
Understanding a the value of a Data Lake House
/

Links to Tony’s Stuff

Website – https://dbinsight.io

Research – https://www.dbinsight.io/our-research

Videos & podcasts – https://www.dbinsight.io/videos-podcasts

Data Lakehouse Market Landscape Report – https://www.dbinsight.io/form-data-lakehouse-open-source-market-landscape

Data Lakehouse Market Landscape Report DEEP DIVE – https://www.dbinsight.io/form-deep-dive-data-lakehouse-open-source-market-landscape

LinkedIn –https://www.linkedin.com/in/onstrategies

Agenda

– Introduction to the concept of a data lake house

– The role of Apache Parquet and cloud object storage

– Use cases for data lake houses and raw data

– The connection between data lake houses and governance

 

Takeaways

Takeaway 1: Watson X Data aims to provide a more efficient, organized, and accessible data storage solution

 

IBM’s Watson X Data is designed to address the challenges of data sprawl and data lakes by providing an efficient and organized data storage solution. With a focus on data governance and performance, Watson X Data aims to make data management more accessible for enterprises.

 

Tony Baer explained that Watson X Data is built on the lakehouse concept, which combines the flexibility and manageability of a data warehouse with the storage capabilities of a data lake. “What a data lake house does is it puts a table format that’s Acid compliant on top of data that sits in cloud object storage in basically a supported popular file format, usually parquet,” Baer said. Additionally, Watson X Data enables better data governance and control, as well as performance benefits.

 

Keith Townsend: If I’m a CTO enterprise architect and I’m trying to create mass appeal AI capabilities, training data inferencing data, collecting the data, storing the data, where does a data lake house such as IBM Watsonx data fit into? My strategy, where it fits into your.

 

Tony Baer: Strategy is if you basically want to expand on the traditional analytics that you’re performing, and you may also be doing some data science there. As I said, it’s not impossible. Certainly it’s not impossible to do data science within a relational table format if your data scientists are willing to work with it, with data that’s already been already been structured per se. So where it fits in primarily what I see the lake house for me primarily is it really extends the data warehouse to the data lake and treats the data in a data lake as a first class citizen.

 

Takeaway 2: Data lake houses offer expanded analytics capabilities while maintaining data governance

 

Data lake houses, such as IBM Watson X Data, provide an opportunity for organizations to extend their data warehouse capabilities while maintaining strong data governance. The lake house concept enables more advanced analytics and efficient storage, making it an attractive option for enterprise architects.

 

Tony Baer highlighted the advantages of data lake houses, explaining that they “extend the data warehouse to the data lake and treat the data in a data lake as a first-class citizen.” Furthermore, lake houses, such as Watson X Data, provide performance benefits and improved data governance, making them an appealing choice for organizations looking to optimize their data storage and analytics capabilities.

 

Tony Baer: So having a table format allows a lot of nice things concerning governance and access control, and in turn, because basically sorting through a table is a lot more efficient than sorting through doing a whole file scan, it also provides performance benefits. So, as I said, Acid support is really the key thing. But with that, since you have a table structure, it brings all these other good things.

 

Takeaway 3: Watson X Governance focuses on model governance, but data governance remains an important aspect

 

IBM’s Watson X Governance currently focuses on model governance, leaving data governance to be handled by Watson X Data. However, as both components are crucial for AI, there is a need to bring these aspects together for a more cohesive approach to governance.

 

Tony Baer emphasized the importance of integrating model governance and data governance, stating, “You need to find a way to bring these together.” While Watson X Data provides a solution for data governance, the need for a unified approach to governance in AI remains an ongoing challenge for companies like IBM.

 

Tony Baer: What I’ve talked to IBM about is you need to find a way to bring these together. Which I think they’re still figuring out.

 

Insights surfaced

– A data lake house provides the flexibility, manageability, and consistency of a data warehouse while utilizing the storage of a data lake.

– Data lake houses bring Acid consistency to data, ensuring that it is current, valid, and consistent.

– The lake house extends the data warehouse to the data lake, treating the data in the data lake as a first-class citizen and expanding the scope of the data warehouse.

– Data lake houses are not the answer to everything, as they are unsuitable for all types of data or modeling.

– IBM Watsonx Governance currently focuses on model governance rather than data governance, which Watsonx.Data handle.

 

Key quotes

– “What a lake house does is it puts a table format that’s Acid compliant on top of data that sits in cloud object storage in basically a supported popular file format, usually parquet.”

– “A lot of data scientists might prefer, not that they might see that having data in a relational table structure might be a bit of a good strength for them.”

– “The data lake is essentially those raw Lego blocks. And that’s for experimentation and exploration also, again, and the data lake is going to be for data that’s not going to necessarily fit in a lake house.”

– “The lake house will not replace the data lake is that the lake house, in effect, basically overlays a relational table structure, which is fine, basically if you’re used to if you’re doing analytics.”

– “AI has basically two components, two ingredients. One is the model, the other is the data. You need two to tango. And right now, Watsonx Governance is all about model governance. It’s not about data governance.”

Share This Story, Choose Your Platform!