Understanding a the value of a Data Lake House

Transcript 3,532 words · about 24 min to read

Machine-generated from the episode audio and not hand-corrected, so names and technical terms may be imperfect. The audio is authoritative.

All right, you're listening to another episode of the CTO advisor podcast. We're recording this fresh off of IBM think 2023 in Orlando with me. I have a fellow Prince of pal I'll call you a prince of pal Tony Tony bear principle of DB insight, I think if you add a couple more and and once you can be DB insights Tony welcome to the show first time you're on the show, right? Yeah, I think so thanks for having me Keith and Thanks for crediting me for for plural insights, but but technically to the realist I basically say I have one insight and it's all about data It is all about data and I thought you were the perfect guest to talk about IBM Watson X Watson X dot data and It is part of their AI platform and toolset it is superseding Everything that you've known about Watson before it has gone Watson X is now Invoked so watching that AI if you think about what Watson studio and all that stuff was before Watson data will get into and then Watson governance is Governance around data analysts.

I did a great podcast with Larry Kavala Tim Crawford and Maribel Lopez in which we went in some detail talking about Watson X, but I wanted to go specifically into Detail around Watson X that data with you Tony because when I heard the announcement The way that they described it was a data store to basically rule all data stores Something that makes the sprawl of data stores and data links Disappear by adding yet another Data link the am I getting that announcement wrong? What is Watson X dot data and where's the valve, you know, if you're come from outside the data world It's a little intimidating if you're from inside the data world The the idea the concept of a lake house may not be fully clear It's in the concept that was actually in terms of the terminology was introduced three or four years ago Well, it's been debate on this there was there was a mention in a you know in a snowflake, you know customer blog I think around 2017 or something like that, but in terms of really popularizing what the lake house is We really have to credit that to Databricks the short of it is is that it's supposed to provide you with the flexibility and the manageability and Basically, you know in the and the consistency of a data warehouse But with the storage of a data lake and when I say storage of a data lake, this is basically object store That's the that's the the high-level definition The the the I would say, you know the the elevator pitch, you know, why put in a data lake house?

Is that what it really does forgetting everything I just said in terms of what it is What it basically allows you to do is to really gain more confidence in the data that's sitting in your data lake Which is basically data sitting in probably like a parquet file in cloud object storage like Amazon, you know, you know s3 And the idea here is it's bringing Acid consistency. In other words, it gives you confidence that the data that you're getting out of there Is consistent its current and it's valid So If I think of this correctly, I believe it or not I'm just coming up to speed with the concept of Apache parquet which is an object file type for storing data and This is a pretty low level File type if I if I understand correctly So injecting this is into something like a data lake house.

There's a fairly simple process okay, let's let's get into a little bit of the gets a little bit of the structure of the structure of what's in a Data lake in a data lake house at the bottom is your physical layer Excuse me, and that's your cloud object storage that typically s3 is the default standard But it could be something like, you know, then I oh or Google cloud storage or Azure blob storage or a or you know, you know a DLS Whatever saying that basically is cloud object source.

That's your physics. That's your physical layer the Advantage of cloud object storage. I should say object storage because it doesn't have to be in the cloud Is that it's very economic it's very economical way and Destroyed data and and its durability is proven So you have lots of data and you want to and you want to keep it for a while object storage You know is gonna be the place to put you're not going to put it Let's say in block storage, which is a lot more expensive now within object storage you're storing files and There are many different types of files that could you could go in there You could have JSON data you could have CSV that you'll feel files Parquet happens to be a column there, you know, you'll format, you know, kind of a column or format first, you know for base You know of four files, you know, which is kind of equivalent to the way that analytic databases Store, you know stored, you know data They tend to storm in column their formats the ideas you're sorting by field rather than by row Which is the actual line item because in analytics you don't care about individual line items you care about.

What is the minimum? What's the mean here value in this field, you know, or what's the mean or max and and you're also looking you know, I mean So you're basically looking at what are the the the you know The typical values in a particular column or field and so parquet is a column or file format It's not the only one out there. There's also or see for instance The parquet has become I would say the de facto standards an open source standard It's become pretty much the open so, you know The de facto standard for storing columnar data in cloud object storage with s through Amazon has to be the de facto standard there When a data lake house does but things so that's fine in terms of being able to analyze data But when in parquet, you're not going to have anything like, you know, you know transactions and Transactions that we're not talking about turns into an online transaction system like an ATM system We're talking about transactions where basically I do an update to data and I want to make sure that update did not get corrupted So I don't have data that was partially updated here and fully updated there and not updated over there with acid Basically it either commits or it doesn't and so that's as I said, that's really That's really the as I said the the elevator pitch and what a lake house does is it puts a table format?

acid compliant On top of data that sits in cloud object storage in a you know in a basically a supportive a popular file format You know usually parquet and then on top of that it puts this table structure And then from as I said acid is really the most important thing is that's the raison d'etre There are a lot of other cool things you can do Once you have that table format because then you can start to get a lot more granular and how you manage basically permissions In other words, there's this, you know, can I access this data or in what type of form can I access this data?

You know and how and should I encrypt or mask this data? And when you have a table format, you can be very selective Whereas if I just throw in a parquet file, I would have to do it would have me all the day I would have to you know, and here I could either get access to that whole file or not So having a table format allows a lot of nice things with regard to you know, government, you know governance You know an access control and in turn because basically Sorting through a table Is a lot more efficient than sorting through it didn't doing a whole file scan.

It also provides performance benefits So as I said acid, you know acid support is really, you know, the key thing But with that since you have a table structure brings all these other good things. So let me walk through some use cases So that I understand if I have the just flat parquet file I can store it in my S3 compliant Storage system whether that's online or some other object storage that has an S3 interface I get a consistent API interface to cheap and deep storage so I can And the parquet Format is generic and open enough that I can Store and share that using basic cloud technologies So this is very useful for exporting and sharing data and known Infrastructure now if I want to do something more interesting with that data, so Run Advanced analytics against I need some layer of Translation or extraction so I have acid so this gives me a table structure.

I can do again as you mentioned governance Etc. So there's use cases for both like if I would want if I just have raw data and I want somebody to Do some type of ETL and translation on that data down the line The probably most efficient way to get it to them is parquet because it's a open standard It's easy to transfer. It's easy to store Maybe not as easy to work with but easy to store if I want to do Analysis against that data.

I want some additional layer. Am I following this correctly? We'll put this way pretty much So the one the one difference here and I think you really you know You're really kind of hinting towards that and this is also the reason why I say that the lake house will not Replace the data lake is that the lake house in effect, you know, basically overlays a relational table structure Which is fine basically If you're used to doing analytics But if you're a data scientist and you want to work with this data raw or you're by putting this into some sort of mock You know, you know building some sort of machine learning model You may not want to be bound by the restrictions of all All those tables the you know As I said that the downside will be is that the data may not be you know May not be current or you know, or as a transactionally correct But the thing is if you're training a model and you're doing it with large, you know, you know rings of data You're probably okay because basically because this is not going to be transactional data.

It's not going to get refreshed I might be working with some geospatial data from NASA or something like this. I'm very static Exactly and that's not going to readily, you know, you know fit into relational, you know You know relational basic table structure per se on the I mean that type of data for instance may be very well suited say for JSON, you know, I mean number one it probably comes naturally in JSON format It tends to be nested files and there's certainly ways, you know, you know to work with JSON relationally and vice versa Oracle has actually done some fairly cool things there But on the other hand if you don't want, you know, it depends on how you're looking to use the data and so If you want to use it in a form that that's that's that that you can access relationally a lake house is great Now just to kind of you know, make the kind of compound matters a little bit, which is that?

You talk to the snowflakes of the world and say you need fun the data You know, they want to basically appeal to the data scientists and so they're you know saying hey, you know through snow park You can basically get this data and you can use you know, you know, you know You know Python routines that are implemented as user-defined functions, and they're working very closely with Anaconda there And they're doing some interesting work. I'm looking forward to getting together with them at the end of June You know, you know to learn where they've gone with that I did have some you know, some good discussions with them last year when they really kicked off the relationship So it's not to say that once you go into a lake house, you can't do this data science or whatever It's just to say that a lot of data scientists might prefer not that they might see that having data in a relational table structure Might be a bit of a good strength for them.

So if I'm a CTO Enterprise architect and I'm trying to create mass appeal AI capabilities training data inferencing data Collecting the data storing the data. Where does the data lake house such as IBM Watson X data? fit into my strategy where it fits into your strategy is if you basically want to Expand on the traditional analytics that you're performing and you may also be doing some some data science there As I said, it's not impossible to you know, it certainly is not impossible to do data science You know within you know within a relational table format if you know if your data scientists are willing to work with it with there That's already been already been structured per se So where it fits in primarily what I see the lake house for me primarily is it really extends the data?

warehouse to the data lake and treats the data in a data lake as a first-class citizen and The reason why I say that is that before this you could do federated query From a data warehouse to data sitting, you know in a cloud object store, but you wouldn't get that transactional consistency You wouldn't basically, you know be able to enforce all the goodness of governance and get the performance advantages So what this does is it's bringing that data that's in the data lake that's in s3 storage or similar It's making it a first-class citizen in your data warehouse it basically expands the scope of your data warehouse and with the With the economics of storage and the economics of cloud compute you can now make that data warehouse basically You know, you can really stretch it.

That's where I see where it fits So let's talk about this from a practical sense as I'm working through it You know, we you know the Lord down the data set we go to the less structured the data the less structured the data the more power I have to Kind of look at the raw data and reimagine it right but that's like getting a box of Lego like Okay, I have a box of Lego and I want to create a Millennium Falcon Wow, I need a lot of skill to get from the box of Lego to the Millennium Falcon I guess if I want a if I if the in object is that one a Millennium Falcon, but I want some customization then I can buy a kit and the kit comes with instructions, but I'm not going to End up building a Star Trek Enterprise from that kit I am going to but I'm going to build variations of a Millennium Falcon It's going to look you know, it might you know have some custom customized tapes on it, but that's it So as the Enterprise architect if you're servicing a broad A set of people who?

Have business objectives those business objectives have to be aligned In order if you're using a data lake warehouse, you know, you want some, you know, some people are going out doing Geospatial type models and another group of people are going out and doing Large language models and yet another group of people are doing models with molecules One that data isn't going to be the same and then to the data houses are going to be independently different for each one of those Use cases based on the desired result.

Am I cracking this? No, I think you're I think you're on to something there because the thing is again I was saying before is the lake house is not going to be the answer to everything I think you know I think what you're gonna see is kind of like it's sort of It's an extension of kind of like the sort of like the levels of where you kind of graduate just like saying before like You know starting like you're just doing a little prototype So you're just gonna take Lego blocks and just assemble them, you know Just you know for what you have to get a primitive idea once you have an idea Then you're going to start to get kits that basically give you you know You know pieces that are more pre-configured for the task And so the way I see is that the data lake is essentially those raw Lego blocks and that's for experimentation and exploration Also again and the data lake is going to be for data That's not going to necessarily, you know It's not going to necessarily fit in a lake house And the thing is includes basically large language models are not about structured data.

They're about language Well as they are same with geospatial now, there is relational data You can extract out of all this but it would be for different purposes So on one end you are definitely still going to have data lakes that are going to be special They'll be special purpose whether it be for exploration or for the types of modeling that you cannot do with relational data Atop that you'll then have the lake house. We're doing more of the extended, you know analytics That you know that you know that you perform And then there still will be a role for the you know For the for the data mark or the specialized data warehouse because you're not necessarily going to have like You're not always gonna necessarily need petabytes of data to find out which sales rep did which did best in which territory last quarter?

unless you kind of wrap this up into a bowl with Watson X dot governance so as I'm looking at my all the challenges around governance in data and Having my data in a Data lake house or the capability provided by a data lake house. How does that connect to? Governance. Oh, it's that's a good question. And there's a discussion. This is basically the beginning of an ongoing discussion. I've been having with IBM Because when you look at my you look at governance in AI You know AI has basically two, you know You have two components two ingredients one is the model the others the data you can't have I mean you need to to tango and Right now the way IBM and I think this is gonna be subject to change But right now Watson X dot governance is all about model governance.

It's not about data governance Mmm, yeah governance will would at least at this point be handled by Watson X dot data What I've talked to IBM about is you need to find a way to bring these together Yes, which I think they're still figuring out So Tony where can folks find your musings? Where can they find your DB? Insight. Okay. Well all the insights that I have from DB insight comes on DB insight Which and DB is short for database. So D isn't David B isn't boy insight dot IO That's where you can find me and on that and on my site I have links to all of my, you know public research and also have links to bit You know to all of my videos and podcasts such as what we're talking about today All right, if you want to find out more about the CTO advisor, you can follow us on the web the CTO advisor calm It's funny you mentioned some people like to add s to the end of data DB insight some people like to Add s to the end of CTO advisors.

I did that at IBM. It's only the primary CTO advisor at the CTO advisor You can reach me on twitter at CTO advisor. Talk to you next CTO advisor podcast Thanks, Tony