Upcoming Digital Event
Join industry leaders from dbt Labs, Fivetran, Snowflake, and data.world to learn about the evolving world of metadata management
How to Scale Data Governance Across your Modern Data Stack
Organizations that have to address regulatory needs have to answer so many data governance questions. This can’t be a manual process anymore.
This week, Juan Sequeda and Tim Gasper will be joined by Ben Clinch, Principal Enterprise Architect at BT to discuss automating data governance.
Speaker 1: This is Catalog & Cocktails. Presented by data.world.
Juan Sequeda: All right.
Tim Gasper: All right. Hello everyone. Welcome, welcome to Catalog & Cocktails. It's your honest, no BS, non- salesy conversation about enterprise data management with tasty beverages in hand, presented by data.world. I'm Tim Gasper, longtime data nerd, product guy, customer guy at data.world, joined by Juan Sequeda.
Juan Sequeda: Hi, Tim. And I'm Juan Sequeda, I'm the principal scientist at the data.world and, as always, it's a pleasure to take a break, middle of the week, end of the day. And the honest OPS thing here right now is that we are live in London, but you're not actually seeing a recording because we're going to be recording this from the place that we were. But the wifi sucks. But this is really just to show you that we really are on the road. We're doing this, and this is the first time it's ever happened. But we're still going to be streaming this live prerecorded anyways. But we are in London. Why we're in London? Because we were at Gartner Data& Analytics, and it's been a while since we've actually been a show together in person with the guest. And our guest today's Ben Clinch from BT Group.
Tim Gasper: Welcome Ben. How are you doing?
Ben Clinch: I am having a great day. Thank you.
Juan Sequeda: Well, thank you for joining us here.
Tim Gasper: Great to have you here.
Juan Sequeda: No. So, today, I mean, coming from Gartner, a bunch of stuff going on right now. But hey, first of all, let's tell and toast. What are we toasting and what are we drinking here today?
Tim Gasper: Yeah. So we're at the Cinnamon Club in Westminster and we are drinking some delicious cocktails here from Tejas and I'll tell you what they are. So, first of all, we got the Westminster gin and tonic. That's the one that you're drinking. It's got monkey 47, rosemary, and black olive tonic in it, so very interesting there. We have a delicious dark and stormy. Always a good classic. And then, I've got a lime leaf Collins with Bombay Sapphire gin lime leaf, lime and soda. So we got some fancy cocktails going on here.
Juan Sequeda: Today is truly a cataloging cocktail inaudible.
Tim Gasper: Truly, cataloging cocktails.
Juan Sequeda: All right, well, cheers.
Tim Gasper: Cheers.
Juan Sequeda: We're doing this in person finally and having more of this stuff. So warmup question we got today from producer, what's your favorite London trend food that is hard to get in the United States?
Ben Clinch: So I'm thinking steak and kidney pie. Did you get that?
Juan Sequeda: Yeah, kidney pie? No, that's a hard one to inaudible-
Tim Gasper: I think it's pretty tough to get in the United States.
Ben Clinch: It's fantastic.
Juan Sequeda: It is really good.
Ben Clinch: Some Guinness in it.
Juan Sequeda: That Ismail. We're having a nice dinner after this, but that sounds really great.
Ben Clinch: That sounds really great.
Tim Gasper: Oh, that's a hard question to answer because I honestly haven't explored enough British food. I don't know if that's a good thing or a bad thing, but what I wish there was fish and chips a little closer to where I live. There's nothing nearby there. So that's a relatively speaking difficult thing for me to get.
Juan Sequeda: Well, it is just not a London tours or the European, it's like just you can go out and walk around the water and just sit down and have a play. That's something I really enjoy. Not here in London but also in other places in Europe, which is not an always easy thing to do in the United States. But all right, kick it off, Ben. Honest, no BS automating data governance. Is it really possible or not?
Ben Clinch: Absolutely and increasingly so. So huge amounts of opportunities around active metadata and semantic discovery of data which are progressing at pace. So each week there's seems to be more opportunity to be able to automate this further, but I think, for most organizations, it's a gradual thing that you need to adopt, iteratively.
Juan Sequeda: So let's go back into... Data governance is something that we've been talking about. I mean now it's such a big thing, but when does it all start? Your background, I mean, you're in telecommunications now, but your background has been finance. Give us your drive through history here.
Ben Clinch: Fantastic. So obviously data's been very important in the technology industry for a very, very long time. But it really reached maturity after the subprime issues.
Juan Sequeda: The 2008?
Ben Clinch: 2008, the collapse of Lehman's and some of the regulators. We're saying to the banks, we need to have greater confidence in the data that you're sending us so that we can, and also measuring exposure to other organizations and in this case, Lehman's. And there was a creation of regulation called BCBS 239, the Basel Committee for Banking Standards. They put together this regulation effectively requiring the financial services to be able to demonstrate the data quality of their data, the providence, where it came from within the organization and how it could be relied upon. And that really was an incredible piece of regulation because it really forced a level of thinking and rigor around that data that hitherto not really had a proper discipline around it and some amazing advances in that space during that time. So much so that the financial services have really sort of set a bar associated with some of the best practices.
Tim Gasper: So really the financial institutions when BCBS or BCB 239, right?
Ben Clinch: Yep.
Tim Gasper: When that came into a place, where all of a sudden innovation just had to accelerate massively in terms of how are we going to keep track of this? Are we on data infrastructure that's going to let us do these things? Are we auditing everything, et cetera, et cetera?
Ben Clinch: Absolutely. Well, interestingly, innovation didn't necessarily catch up as quickly as it could have. So there was an awful lot of-
Juan Sequeda: This is an interesting point. It's like, " Well, that means that we should have been automating this 15 years ago, whatever, but absolutely we're not there yet. And now is the moment."
Tim Gasper: So was the initial response, very manual in nature?
Ben Clinch: Very manual. And sometimes that's how you have to start. So to be able to automate something, you have to understand what you're trying to achieve. And so, with a lot of the data management principles that were established then around good stewardship or ownership of data, where it was sourced from, how it was described was often highly manual and what we would call very steward led. And that was okay to some degree because particularly in the financial services by definition, banks have a lot of money and so they could throw bodies at problems. And actually, also the scope of the data wasn't always that huge because it was very much focused on the regulatory reporting data that was being shared with the regulators. So people would start with that. And so, that meant that the scope of the data was manageable. But actually, increasingly what we're seeing is as that best practice is going out to other industries, who maybe aren't as willing to do these things as manually as the banks previously had, there's much bigger push towards how can we do this win with technology, how can we really automate this and innovate as you say so, and that's really accelerating. The banks are now becoming great adopters of that technology as well, but it's really moving at pace in terms of the automation. There's some exciting things that I'm sure we'll discuss.
Tim Gasper: Yeah, no, that's super interesting. So before we leave the history of things. So you mentioned about some of the regulation that came into the financial industry. Is there anything else that you would call out from a historical standpoint that you think is really driven a lot of the push for governance and then now the push for automation?
Ben Clinch: Well, absolutely. So one, I think accountability was really important. So understanding who actually was responsible for that data and helping organizations create a set of principles associated with that. But actually, interestingly, there are some of the regulators, particularly in commodity space, have started to describe their regulation as code.
Juan Sequeda: So this is a now a two- way street.
Ben Clinch: Yes. And these are, I mean, obviously regulators want to make their regulation as easy to follow as possible. I mean initially some of the guidance was fairly open- ended and open to interpretation. The more that regulators can be specific around that, the easier it is for organizations to innovate and scale. So that's a really interesting innovation.
Tim Gasper: That is super interesting. And as regulation gets more code and it becomes more explicit versus vague, that probably increases the opportunity for automation now. Because now, you're automating towards something very specific.
Ben Clinch: Yes, because otherwise each organization has to almost codify its interpretation of a rule, often in plain English and where interpretation is open- ended or ambiguous, then there's an opportunity for people to misinterpret or misapply. So when it's very specific, then you can actually say, an example that I often use is if German citizen data is only allowed to be accessible from Germany. We would codify that from a perspective of if this is German citizen data, then it must be stored and only accessible in Germany according to X regulation. So you've actually got an if this then that statement, according to which gives traceability back to the original regulation.
Juan Sequeda: Yeah. So basically, almost every rule which is codified, actually has its lineage back to why its reason for its own existence. And you can see all these rules are, at the end it's like, " Well, I have this rule which I'm already using for some particular regulation, by the way, it addresses some other regulations." Have either the same rule or very similar rules and so forth, you can do that.
Ben Clinch: That's it. And as the regulation changes, then you actually have traceability. Because regulation always evolves and that's cool, the requirements change. And so, you have that flexibility, you can trace it back, you can say, " These rules will now be impacted and updated as a result of that."
Juan Sequeda: So what's fascinating about this that we're now at a meta level. We talk about providence, about where data comes from, but here is like, " I want to have the providence of where all these regulations, these rules come from." The things that we're trying to go automate saying, " Well, this exists for this reason and how can we reuse that too and so forth."
Ben Clinch: Absolutely.
Tim Gasper: Also, on this sort of regulatory front, for some of our listeners and listeners that are not as familiar with what the regulations are really driving. You mentioned about provenance, you mentioned about auditing. What are the different activities that an organization has to do to support these regulations, but then there's probably a little bit of a broader umbrella of just governance in general. What are the different activities and tasks that an organization has to do and why?
Ben Clinch: That's fantastic. So there are a number of them. I guess starting with privacy. Obviously, it's really important to treat people's data with due care and respect, ensuring that you're making the most out of that data to provide good services for them, of course, but also making sure that you are maintaining their privacy. So a key thing in Europe, for example, is GDPR, which is a key regulation. For example, one aspect is a thing called article 30, where organizations are required to be able to demonstrate where data is coming from and going to with regards to personally identifiable information, who's processing that in what location and for what purpose. And that that's defensible and proportionate. Also, things like retention and archival of data. So people are only keeping data for a specific, justifiable business purpose for as long as is justifiable. But also, that they're deleting that. And it's not only justified but also so that's when people request the right to be forgotten, their data is deleted. That's an important principle in terms of privacy and also that people can ask what data is being stored about them, so they have subjects access requests.
Juan Sequeda: So there's all these questions that, I mean basically, if we look at these regulations and what we're trying to govern the data for, they're effectively a set of questions, queries that we're trying to go answer. And those things should be automatic, right?
Ben Clinch: Absolutely, right.
Juan Sequeda: So if we think about this just as like, " Look, if the policy I have to go address is really a question and that question is a query." Then if you think about it that way it's like, " We're just writing a query over data." Why would even think this is a manual thing? This just codify it, press a button, answer there, we're done. So I think that's kind of the mentality we need to get to. But right now, that whole process, imagine trying to go resolve a question right now where you don't have a database that you're going to go run your query. I mean today, we have these issues with data, I'm trying to integrate data. There's still a problem here about automating, but we got to see the problem this. It can't be that complicated. We can't think about it. It shouldn't be that complicated.
Ben Clinch: 100% right. And actually, as you codify this stuff, you can then say, " Well, we now know the metadata that we've got to be collecting." So it provides a framework in which you can then start to... We're taking science and engineering best practices towards applying effectively to the law in this instance. So that you can say, " Okay, we need to know that this is German citizen data. We need to know the server location, we need to know the location of anybody who's trying to access this." And then, we comply business rules over the top of it that ensure that you're enforcing compliance and detecting any potential anomalies associated with it. But I mean, we're talking about the defensive aspect there as well. I mean the same goes for the offensive data management approach, which is the value of data. If we know that certain types of data is valuable for a particular business process or business capability, then we know if we can detect that and we can still codify that. If this is type of data, then it is valuable for X business capability according to the owner of that business capability, for example. Then you can then say, " Okay, we now need to semantically discover that information." And from my perspective, the more that we can understand the value and the context of data, whether it's regulatory or privacy driven, or value driven, it's actually very similar in terms of the way you codify it and scale that.
Juan Sequeda: The key word for me here has been codify. That's what we need to, I mean, we can't automate what we can't read basically from a machine perspective. So we need to be able to codify.
Tim Gasper: So we can't just trust that ChatGPT will solve it for us.
Juan Sequeda: It will probably help, but no.
Tim Gasper: It won't solve it on its own. And the codification in the metadata seem like they're both key aspects there.
Ben Clinch: Absolutely. And look, I don't want to underplay, large language models can do a huge amount to help with this, but actually for me, the core is first of all, organizing your metadata into an information model. And so, we talked a little bit about the metadata as an example there. I see this as a continuum that is described as a knowledge graph effectively. So this is how you structure your data in such a way that is scalable, and then the business rules help you leverage that information.
Juan Sequeda: Yeah. So this is definitely, I mean you're singing here to the choir, you're preaching to the choir here. I mean, at least to me. But I think the codification is really about, let's put this in a structured way that the machines can understand it. And then, it's all about linking them together. We're just like, " Well, I have this rule, this policy, it's related to some other policy, to some regulation." Now, you start making those relationships, " Hey, this looks like a graph." So effectively, what we're seeing right now is that knowledge graphs are the way of being able to codify the metadata, these policies altogether. And we were talking on our way over here, on our ride over here it's like we were having another podcast that we should have been recording that conversation, but I was telling you-
Tim Gasper: Uber sessions, right?
Juan Sequeda: Yeah.
Tim Gasper: The VIP episode.
Juan Sequeda: Yeah, we should have one of those too. Yeah. But I was telling him, I have been talking to Gartner for almost like 10 years saying, " Hey, semantics and..." Even before knowledge graph was the thing in doing that, and it's now getting there and it's really, really cool. Cool is the word, I mean. When I go talk to people about knowledge graphs, there's less amount of eyes staring at me like, " What do you mean?" No. People either now get it or they're like, " Yeah, I've been hearing it all the time. I need to get this more, explaining more." And there's still people like, "I don't know," but it's really been decreasing, and this is really exciting. I mean that's my perspective, but-
Ben Clinch: I absolutely agree. And actually, I mean instantly. So I was a relational data guy a while back and as soon as I was educated around knowledge graph, I instantly saw the value. But that's because I've lived and breathed data for a very long time. Now, going to Gartner this week, there has been so much talk about the knowledge graph being the center of the data fabric and it's so refreshing. And actually, having said that though, I still think there's a lot of opportunity for better training and better access to this information. I'm a big proponent of a number of different courses and books that can get people up to speed. But actually, I set up a graph guild at BT, where I currently work, 200 members now. So we meet regularly, we talk to semantic experts and we explore different graph technologies. We use graph quite a lot because, effectively, a network, which is a major part of our proposition is a physical graph. It's literally a graph instantiated by nodes and edges that you can literally touch.
Tim Gasper: Yeah. You have graph operating at a few different levels, right?
Ben Clinch: Yes.
Tim Gasper: There's sort of the network, the graph of the actual infrastructure and things like that. And then, you have more of your conceptual graph and your governance graph and things like that.
Ben Clinch: Absolutely. That's exactly right. And it was more successful than even I expected, to be honest, the people are hungry to learn about this. And actually, some of the things that we've been exploring is how you explain graph to people in a way that is instantly accessible. And so, there's a couple of different examples. One is everybody gets mind maps, and in many ways a graph is a kind of more rigorous mind map in that sense. The other thing is trying to explain to people the importance of information model. So actually, graph is, or the anthology within a graph and taxonomy is effectively a data model. And for a long time, the vendors around data lakes and other sort of no SQL kind of structures were saying, " You don't need a data model anymore." And what that really resulted in was people putting data into lakes that they couldn't understand or that data scientists, really skilled data scientists were spending all of their time effectively becoming data janitors, trying to work out and clean the data and reorganize it, when it could have been done once using a schema or an... So I'm sure you are familiar with this.
Juan Sequeda: I think by now people are realizing that-
Tim Gasper: The pendulum is swinging in the back in there. Yeah.
Juan Sequeda: The pendulum is swinging. I always talk about the types of the ordinary truths and the profound truths. And an ordinary truth is one that whose opposite is a falsehood. So, database systems need to be faster. Well, of course, because if I negate it, it didn't need to be slower. That's dumb-
Tim Gasper: That doesn't make sense. Yeah.
Juan Sequeda: ... so it's false. So it'san ordinary truth. But a profound truth is one that whose opposite is also a profound truth. So you could say database systems need to have very strong, well- defined schemas. Well, the opposite is database systems do not need to have well- defined schemas. Well, that's what we've been doing for the last... We had a pendulum on one side, that was a whole no SQL movement and that was good. And now we're realizing, so I think the whole schemas in semantics around data is a profound truth that we can have this discussion.
Ben Clinch: I 100% agree, and actually, we don't have to make it hard. As we say, there's a lot of technologies that we mentioned earlier that can actually speed this stuff up. But I mean even now talking across the industry to people, some of them are saying, " We don't need a data model." So there's an analogy that I use a lot and actually, I spoke about it in Gartner yesterday, which was when trying to explain the importance of a data model to people who are unfamiliar with why that's important. I use the example of the org chart. So, of course, people are a very important asset in any business. I mean that's apparent to everyone and everyone from the CEO to right the way down. Everybody knows where they sit in the org chart. It's the first thing you look up when you join an organization-
Tim Gasper: Who's my boss and who's my boss's boss, right? And...
Ben Clinch: Absolutely. And then, how we organized and how do I contact, or how do I locate other people within the organization? What kind of title do they have? What seniority do they have? Who pays for them? The cost inaudible-
Tim Gasper: There's a lot of context that is embedded in an org chart structure, right?
Ben Clinch: Absolutely. And would you ever hear somebody say, " Well, we don't need an org chart. We don't need to organize our people, or what's the return on investment on an org chart?" I mean, I note these. Well, I inaudible-
Tim Gasper: Why don't we wait until we get big enough and then maybe we'll establish an org chart.
Ben Clinch: Well, yes. Actually, that is in a very small company might be justifiable.
Juan Sequeda: Exactly. But that's when you realize there is none. But then it's a pointed flip. Yeah.
Tim Gasper: But once you have more than 20 people or whatever it is. Yeah, yeah.
Juan Sequeda: So I love this analogy you're doing, but let's go into the work that you've been doing with so many people around the EDM Council and CDMC. But this is something that so many people in the world have gone together, companies to develop all these data models so we don't reinvent the wheel. And so, please tell us more this.
Ben Clinch: Yeah, that's about right. So one of the things that's really exciting, and I volunteer at the EDM Council, I'm a certified trainer in DCAM CDMD, so it's just a...
Juan Sequeda: No, I know we're an non- salesy podcast, but anything about education, please go sell it.
Ben Clinch: And it is non- profit.
Juan Sequeda: Yes, so please go sell, sell, sell, sell.
Ben Clinch: So DCAM, which stands for Data Management Capability Assessment Methodology, is a framework that was developed by the EDM Council primarily in response to BCBS 239. It was heavily informed by the banks initially and is now widely adopted by many regulators and many industries. There's a fantastic free benchmark report that you can see against different sectors around the world there maturity. Because against this, it's very objective. You can measure how mature you are as an organization against objective measures. And it's very much based on what you do, not how you do it, recognizing that technology and processes and approaches change, but actually the fundamentals of what you need to do or achieve can be measured. And so, we use that and, as I said, I'm a trainer in that. It's on version two at the moment, and there is actually going to be a version three next coming. So it's going to be developed later in the year. There's 360 members of the EDM Council and the huge multinationals often including all the hyperscalers as well. So one of the other developments that, actually, I was involved in. I led a couple of the sub capabilities, this is a thing called Cloud Data Management Capabilities or CDMC. And this was established by, well, I mean just by sheer coincidence, I already knew a gentleman called Oli Bage at the London Stock Exchange Group and Richard Paris, who I've worked with previously at UBS. And when they were working together, they were saying, " We really find that when we're interacting with hyperscalers about what our requirements were, that's not a standard. So we're having to say the same kind of requirements separately to each hyperscaler. Why don't we explore the opportunity to be able to create the standard that everybody's, and that's going to help the hyperscalers serve us and actually help people get value and manage down the risk of cloud." So they approach the EDM council, Mike Meriton and John Bottega and great leaders, they basically said, " This is a great idea, let's explore this." Now, it became a runaway success. There were 100 companies involved, 300 businesses-
Juan Sequeda: So this is basically defining all the metadata about all the types of data sources, things that are going to be in the cloud.
Ben Clinch: Well, yes, it does include that actually. And yeah, it's a very good point. But it's all data management practices and that you can objectively measure and the automation thereof. That means that the scale in inaudible-
Juan Sequeda: So this goes back to what we were talking about earlier, this is now all these regulations or these rules, they'll codify that you know what this is and we're having inaudible-
Ben Clinch: That's it. And when you're at the hyperscaler kind of size. I mean over a petabyte of data, you really can't throw bodies at this stuff. I mean it's a bit of a simplification because it's probably complexity more than scale, but there's sort of a general rule that that can apply. And actually, you hit on a really important point. So as part of the CDMC development, we built this information model, and it was soft launched.
Juan Sequeda: So what does the information model consist of? What are the main concepts or something?
Ben Clinch: Great. So it's described in RDF, the Resource Description Framework as a true knowledge graph. And it brings together... Well, it started from the CDMC. So we took the entities out of the specification for the framework and we started to map those as a graph. And then, we mapped them against existing W3C standards like Egeria, DCATs, DQV, PROV-O. I recommend looking these up. For the listeners who aren't familiar, these are established W3C standards that are used around the world but not always together. And this actually stitched those towards a really beautiful metamodel, which is free to use and is scalable and actually has documentation that describes how all of these things relate back to the original controls from the CDMC. Now to me, that's an incredibly powerful tool because I've talked to lots of people about using flexible data catalogs and flexibility is really important, I think. So I'm not necessarily saying people should have to be constrained by the information model-
Juan Sequeda: They should be able all go inaudible-
Ben Clinch: ...because they don't know where to start.
Juan Sequeda: Yeah. This is our starting point, don't reinvent the stuff that we've done. So Egeria has been one of those standard around also about governance. And then, DCAR is a data catalog vocabulary that takes all the data sets and stuff. The PROV- O is about providence, where things were derived from. So you'll be able to combine this to go define, it's basically a metadata model that is the underpinning for any type of, anything you need, any resource that you need a catalog and that's the governance over it.
Ben Clinch: That's absolutely. It's almost a blueprint for a digital twin in some regards. So it's not going to map out for those unfamiliar the domains. So if you have a billing domain, it doesn't have domain specific knowledge. This is kind of the framework that everything that hangs off. And so, you still need domain experts to be able to articulate what you need to do for billing. Great thing is there are lots of schemas out there as well that you can start from when you're developing those or say with those start from them. Don't constrain yourself to somebody else's view of how your business works.
Tim Gasper: Yeah. I feel like a lot of folks aren't familiar with the amount of prior art and pre- established schemas are out there for things like governance and things like cloud infrastructure and things like that. How do you go about whether it's Iridium, counselor through other things. How do you go about educating folks about what's out there?
Ben Clinch: In terms of the kind of established structures?
Tim Gasper: Yes. Yeah.
Ben Clinch: So the first thing really, honestly, is to get people trained. I think first of all, one is get them trained in the particular cloud vendor that they're using the hyperscaler so they understand what kind of databases they have and what kind of capabilities they have. I think also getting people to think in terms of cloud agnostic solutions. So I hear a lot about cloud native solutions and I'm all for cloud native solutions, but what is often confused with is cloud specific solutions. So if you have a portability requirement like many of the financial services do, that you need to be able to demonstrate, you can move from one cloud to another. If you have a cloud specific solution, that's going to be an immediate constraint. So now that's not to say that something that is developed by one hyperscaler has to only work on their system. Some of them are designed to actually work in such a way that they can harvest metadata across different environments, including on- prem. And so, I think one of the things I really encourage people to look at is the interoperability of the tools that they're using. I mean you can probably see a theme here, the interoperability or the flexibility associated with regulation changing, the flexibility of the data model changing. Because like an org chart, nobody says, " When is the org chart done?" Never going to finish. It's the same with a data model. It's a living, breathing thing that evolves with you.
Tim Gasper: Yeah. It's not a project, it's an ongoing capability, it's an ongoing asset.
Ben Clinch: That's it. And so, in the same way, your architecture should be interoperable. And that's why I embrace standards like RDF, where it actually allows a lot of different components to interact in a very consistent manner against an established standard. Yeah.
Juan Sequeda: So one of the questions I'm thinking is when should I start? Do I have to get to some state to be thinking about this or... Because this is the things I always talk about, the balance of being efficient and resilient. I'm like, " I need to do these things fast, so I'm not going to go invest in doing all this stuff." But then later on I am scaling and I'm like, "Oh, I have all this set." So what are your-
Ben Clinch: What a fantastic question. So first of all, start now wherever you are on your journey and how you go about that is, first of all, establishing what's important. So one of the things that's not often underlined enough is understanding your business architecture. So what are you trying to achieve? And codifying that, again, preferably in machine- readable diagrams, for example. So haven't been an architect, we often joke about drawing lines and boxes. But if those lines and boxes aren't described as code, then you can't measure your organization against that. So first of all, what are you trying to achieve as a business? What's your business strategy, your business outcomes? What business capabilities and processes support that? And therefore, the context of the data that you require. So which business capabilities are producing data and which ones are consuming them? And if you can articulate that, again by what we call declaring critical data, it's critical for a purpose. And that purpose could be regulation or regulatory reporting in the case of BCBS 239. But it equally could be for great customer service or effective billing or revenue assurance and these types of activities. So start by understanding your business and the purpose for the data and then you can prioritize. Because each of those business capabilities has the value that you can put money against. It might be defending revenue, it might be generating revenue, it might be saving money, it might be managing risk, and that could be regulatory risk, it could be reputational risk, et cetera. You can quantify these things. And then, you can say this is the priority for where we're going to start because you start with a thing that's most important, and you build out from there.
Juan Sequeda: Hold on. I love how you... I mean, we always say like, " What is the business value?" So start with the business, understand what the business, but what you should be doing too is that when you're doing that process, codify that understanding of it. And that maybe is, right now, we're focused on regulation. Well codify that regulation. Those policies need to go do. I love this. Codify is a key word today. You wanted to follow up.
Tim Gasper: So before we start to move into our coveted lightning round, I do want to ask you one more thing, and I want to bring it back to where we started this whole conversation around automating governance. And I want to ask you, as you look at what we need to accomplish from a governance standpoint, and you look at business organizational capability and you also look at the technology that's available. What can you automate now? This is sort of a two- part question. What can you automate now that's worthwhile? And can you also tell me, what should you not be automating? What are people think? Think maybe you can automate, but that's like, " We're not there yet."
Ben Clinch: Yes. So that's a really good question. I think some of the things you can automate now are the rules and the rule and the business logic and actually starting to detect metadata in a more scalable manner. So there are capabilities out there, technology capabilities that can help you identify some of that data that we defined as important for those business purposes. So you can say, " We'll program it once, preferably." I mean, I'm a big fan of using integration pipelines as a means of actually sampling and tagging that data with these semantic meaning of that data. That's metadata, so to speak. And there are so many different ways you can do that. You can stop something as simple as regex, which is the shape of the data. And that gives you a hint of-
Juan Sequeda: Well, just start simple. You don't need to do any sophisticated stuff inaudible.
Ben Clinch: But then you can do something as complicated as fingerprinting, which is effectively, I've seen these many data values before in a different data set that was already tagged, and therefore I can infer with a certain level of probability that it's most likely the same semantic meaning that I can tag it with. Now, those tags all become perfect metadata to drive this business rules. But also, I mean increasingly the ability, and we touched on it early, large language models are great at doing some of this stuff, but an early stage of that and the ability to be able to make sure that those are monitored and are giving the right values that's...
Juan Sequeda: Yeah. So then to the other side of the question, what should you not be automating yet?
Tim Gasper: What should we not be automating?
Ben Clinch: Well, for example, unattended semantic discovery I think is something that's still early days. I think you want to have a real sense of certainty associated with probability of that. I think also-
Juan Sequeda: So I mean that that's something that we still need to go talk to people, figure out what this is.
Ben Clinch: Exactly, and automating data models. Now, we can augment and speed up data model, no, data modeling using some really, I mean, large language model is perfect for that kind of stuff. But you want to make sure that that's only accelerating you because you don't want to outsource your understanding of your organization to a third party, in my view.
Tim Gasper: I like that quote there. You don't want to outsource your understanding of your own organization.
Ben Clinch: It would be like outsourcing your org chart. Can you imagine somebody saying, " I always-"
Tim Gasper: Unfortunately, some companies do that.
Ben Clinch: "...come back to the org chart." So that might, again-
Tim Gasper: But we won't talk about consultancies inaudible.
Ben Clinch: I'm always operating at very huge scales. But yes, and you can get advice from other people.
Tim Gasper: inaudible. There's a possibility that this would be serious.
Juan Sequeda: These large language model is like productivity. It's also advice because the input that you made inaudible-
Ben Clinch: Advice is great.
Juan Sequeda: So this is a good segue into our AI minute. So one minute to rant about AI, whatever you want. Ready, set, go.
Ben Clinch: Fantastic. So AI is something I've really always been passionate about. So I did that at university, wrote back propagation, neural networks, and algorithms. And to some degree, I wish I'd done more of it between now and then because it's really taken off. What I would say is that these things are incredibly powerful and augmenting what we do. But they are not something yet that we need to be overly scared of in terms of completely replacing people. Or if we do, we need to be ready for the consequences associated with it, which is that it's incredibly powerful stuff. We need to be harnessing this for our benefit but be mindful that it needs to be fed with good data. And if it's not fed with good data, which is all about the cataloging we're talking about in metadata, then don't be surprised if it doesn't give you the right answers. We need to empower the LLMs with the knowledge and the context that we've been discussing today.
Juan Sequeda: Perfect. Beautiful timing.
Tim Gasper: That was great. Garbage in, garbage out.
Ben Clinch: That's it.
Juan Sequeda: That's it.
Ben Clinch: It's still-
Tim Gasper: So true, right?
Ben Clinch: Yep.
Juan Sequeda: So let's head to our lightning round presented by data.world, and I'm going to kick it off first. So just quick, yes or no, a little bit of context needed. So does your metadata model have to be fully designed before you can start automating governance?
Ben Clinch: No, but the more structure that you have in the metamodel, the better. So no.
Tim Gasper: All right. So next question, would you say that as regulations are codified and as things like the EDM Council, the work streams and the standards become more popularized. Is governance now becoming mostly cookie cutter? Is it over 50% cookie cutter?
Ben Clinch: No. And should I qualify that?
Juan Sequeda: Yes. Well, you can.
Tim Gasper: Yes. Yeah.
Juan Sequeda: It's moving in that direction for the mundane, but actually the real opportunity is in the offensive data marketing, which is always linked to your business purpose, which is always unique to your organization.
Tim Gasper: Yeah. The purpose is always unique.
Juan Sequeda: The purpose. Yeah, that's unique. All right, next question. Is governance metadata a good place to start for your first knowledge graph?
Ben Clinch: Yes.
Juan Sequeda: A lot of people are doing knowledge graph. But the thing about your people, about the...
Ben Clinch: Okay, so I'm going to qualify that'cause I'm very into data management, but actually knowledge graph's incredible for fraud management, for IT asset management, for network resilience. So all of those are beautiful. Data management is also a great place to start. What I would say is all of those use cases should be coordinated in such a way that they're building out your enterprise knowledge model. So let's do it with RDF or label property graphs. But it's something that you know are going to-
Juan Sequeda: Don't do it in silos.
Ben Clinch: Don't do it in silos'cause then you're building out your own.
Juan Sequeda: More applications that happen to be in a graph. Yeah.
Tim Gasper: Ideally, your knowledge graph is cohesive, and these things are coming together.
Ben Clinch: Exactly, right. People are kind of painting the picture for you bit by bit. And while making money and making business benefits from their individual use cases built on graph.
Juan Sequeda: Final question, Tim.
Tim Gasper: Should all data organizations aim to automate their governance?
Ben Clinch: Yes, but well, augment. I think full automation, I'm a big fan of the context of metadata ops. So this is some people who have the responsibility for ensuring that the automation is actually serving the purpose of the business. But yes, everyone should be striving for automation around this to the extent that it is.
Tim Gasper: Can I interpret that as, so when you say augment, keep humans in the loop?
Ben Clinch: Yes.
Tim Gasper: But automate as much as you can with that caveat.
Ben Clinch: 100%. Work smart with technology.
Juan Sequeda: All right. Tim, take away time.
Tim Gasper: This is an awesome conversation. I knew it would be. So some quick takeaways'cause I know we've got some delicious dinner coming up here shortly. You can automate your governance and you should. And this is the direction that we're moving in as data organizations, as governance in general. And you mentioned that really semantic metadata is progressing, and this is a gradual process, but it's getting to a point where now we can automate a lot and more on automation later in the takeaways. You started with the history. You said that in 2008, the collapse of Lehman's and the regulations that happened in the states, that happened here in the UK and across the world. It really was kind of a trigger point to say, " Look, we can't just hope that you're going to do the right things around data. We're going to mandate it." And the initial response by organizations, especially in the financial industry,'cause that's where it applied more, was pretty manual in nature. Stewards going in and doing a lot of work in spreadsheets or databases and things like that. Of course, now we know there's technology that's emerged to help solve a lot of this, but the scope initially was semi manageable because it was a very focused set of processes and data. But data has gotten more complicated, governance has gotten more complicated now, not just financial institutions want to do governance. So it's becoming a bigger thing. To automate, you probably need to start manually figure out what you need to achieve, but you want to try to automate from there. You mentioned accountability's important. You mentioned also that regulators are really starting to push more towards regulation as code. And this is a good thing for everyone because as we codify it means it's going to be really easy to interpret. It becomes more explicit. If this data is related to a German citizen, then yada, yada, yada. And that's going to make things a lot clearer for everyone. A lot less vague, a lot less open to interpretation. And you mentioned that especially things like data privacy, GDPR. These are a lot of the emphasis of how kind of governance is focused these days. And we talked through different aspects there. And before I pass it to Juan, you started to talk a little bit about the path to automation, which is, you have to see automation as answering a query. As things become more codified, then we can use the metadata to actually drive the automations around governance and regulation. And you should really think about organizing your metadata into one information model. And that information model should be a knowledge graph, should be RDF. With that, over to you, Juan.
Juan Sequeda: Well, I think the keyword word here, and I've said it before, it's codify. We need to have that in on a t- shirt. Like codify, right? Codify. And then, you're coding this and you're coding that. Those things are linked together to graph, and then that's where the knowledge graphs come in. And I love how you say like, " How do we explain the knowledge graphs to everybody? Mind maps is one way to think about it." And then, we start seeing that, first of all, we've all had schemas, we've all thought about this stuff, but then the no SQL movement took the pendulum to one side and then now we're bringing back about it. And I love your analogy, your examples like data models are important. And one way to explain is just look at your org chart. Do you ever think about like, " No, that org chart is not important. That org chart does not provide ROI."
Tim Gasper: What is the ROI of the org chart?
Juan Sequeda: What is ROI for that? No. Imagine you didn't even have one. You need to know who you report to, who reports and who to contact, who's related. There's so much context in there that you almost take it for granted. This is an important thing. You can't even imagine a world where you wouldn't have that. And then, we've really dive into the EDM council. So the law, this is incredibly valuable worker. It's a non- profit. So many different large organizations in the world have gotten together to develop DCAM, which is a big framework that has been heavily involved, informed by all the bank regulations. And then, the CDMC, this has created a standard to repeat all these requirements from all the hyperscalers around, have this information model that really has combined all these existing standards from data cataloging governance like Egeria, DCAT, the data catalog vocabulary from the W3C, providence. And all of this is in RDF and open standards, which effectively is this blueprint of your digital twin that you can go do that provides all that metadata. But all this infrastructure that you have within the organization. How do we become aware of these things out there right now? So your recommendation, let's get trained on cloud solutions, but also realize it's not just being specific on a cloud, but we also think about you need to be portable. People need to go move around things. So again, more keywords here is think about interoperability and flexibility and all this stuff is changing. Regulation, your data's changing, your regulations are changing. And when do you start? Start now, there's nothing stopping you from doing this. And the way to start is to really understand your business strategy, understand the business outcomes, your architecture. And this is going to help you drive understand what is the purpose of all of this. Is it for regulation? Is it for effective billing, for revenue assurance? Understand that purpose. And the moment you're understanding that business, codify that. This is the beautiful moment to truly have it in code such that you can start automating more of this. So to wrap up, we said, " What should you automate now? What shouldn't you?" Automate now is all the rules, these business logic. All these detections that can't be done and you can start with simple things like this regex, and then let her get complicated with fingerprinting. And what you shouldn't be automating is really the semantics, extracting the know. You need to go talk to people and you need to go figure out what those data models are in people's head, how the business is working. Yes, there are tools to augment you into it, to make you more productive, but don't outsource that. You do not want to outsource your understanding of your own organization.
Ben Clinch: Fantastic.
Juan Sequeda: How did we do? Anything we missed?
Ben Clinch: That's an amazing summary. I think you got all that.
Juan Sequeda: That was all you. All right. So wrap it up because we got dinner coming soon, is three questions. What's your advice? Who should we invite next and what resources do you follow?
Ben Clinch: Well, first of all, yeah, my advice is very much in line with this. Get familiar with Knowledge graph. Get familiar with the EDM Council.
Juan Sequeda: Love it. Knowledge graph, EDM council. Who should we invite next?
Ben Clinch: Piethein Strengholt.
Juan Sequeda: Who?
Ben Clinch: Piethein Strengholt.
Juan Sequeda: Okay.
Ben Clinch: Absolutely. I'll send you the spelling.
Juan Sequeda: Love it.
Tim Gasper: Okay.
Ben Clinch: And what I would recommend is his book Data Management at Scale, the second edition that has just recently come out. I think it really underlines a lot of the stuff we've discussed.
Juan Sequeda: And to wrap up, what resources do you follow? Books, podcasts, newsletters, people?
Ben Clinch: Well, I'm a big fan of Piethein. I'm a big fan of semantic. Semantic web training as well. So these kind of resources, I can send you list to add to the show.
Juan Sequeda: All right. Well, Ben, thank you so much. Just quick, next week, I'm going to be on vacation. I need to take a break, but we will actually record. We'll have a Tim and Juan rant on what's been going on the last day.
Tim Gasper: We'll do a little bonus episode.
Juan Sequeda: We'll do a bonus episode and all that. But with that, Ben.
Ben Clinch: I look forward to it.
Juan Sequeda: Thank you so much.
Tim Gasper: Cheers.
Juan Sequeda: Appreciate it. Cheers.
Speaker 1: This is Catalog & Cocktails. A special thanks to data.world for supporting the show. Karli Burghoff for producing. John Williams and Brian Jacob for the show music. And thank you to the entire Catalog & Cocktails fanbase. Don't forget to subscribe, rate and review wherever you listen to your podcast.