With the hype of graph databases and knowledge graphs, a common (mis)practice is to quickly migrate your existing siloed data into a graph database. But be careful! You may just be bringing the complexity of your silos into the graph.

Tim and Juan were joined on the Catalog & Cocktails podcast by Jans Aasman from Franz Inc, the makers of AllegroGraph, for a conversation on why your graph-based machine learning and 360 projects should start with data modeling. Below are a few questions excerpted and lightly edited from the show.

Juan Sequeda:

Honest, no bs question why don’t we talk about data modeling?

Jans Aasman:

When we do projects with customers, we always start with data modeling, but let me give you another answer. I’m a psychologist. I accidentally got into technology and became a CEO, but I’m still at heart, a psychologist. You have to imagine that when people make schemas for relational databases – very intelligent people that put these schemas together – don’t really care about if other people can read the schema. So they use abbreviations. And so now people want a very easy method to untangle the craziness of the schema that people had in relational databases.

But the problem is there was so much human intelligence, that is most of the time undocumented, that went into making that schema. And now people hope that there’s an easy tool to untangle that again. But you need the same amount of human intelligence to do that, maybe even more. It’s like reverse engineering sometimes, especially if you don’t know that new enterprise data warehouse that you suddenly have to get data out of. So data modeling is very complicated and you can’t make it systematic. 

Then I taught data modeling. I mean, I did oriented software engineering when I was teaching at the University in Delft. And that’s actually the most important modeling technology that I ever learned. Starting with the models, the stakeholders, what is the use case? It’s the interaction model with the analytical model.

And even now when I help people that want to do modeling and take the data that they have in their silos and put it into a knowledge graph, I said, “Okay, the first thing I want you to do is take a really deep breath and forget all about protege and top end composer and ontologies and Al and all of that.” If you are a software engineer and you did object oriented software engineering, you’ve got everything you ever need. And then the other thing I don’t do is start from the top down, very complex logical models.

I really find it terrible to see people that start building an ontology without knowing what the application is going to be, or the data model. The data model is always a function of the questions you want to answer.

Tim Gasper:

Can you go into a little bit more detail what do you mean when you say you prefer bottom up modeling?

 

Jans:

So I’m a list person. I mean, we have a list company here and we sell list compilers. And there are some people that start with a defined function and then they do the top three steps and then they go to the first step and they make that in three steps top down, really trying to make the tree. Whereas what is way more natural is if you’re in a particular domain, you do tiny sub-functions that you think you’re going to need. You try them out in your language. You can see if the lowest level functions, XT will work. You make a language that’s very specific to the domain you’re trying to solve. And then you can go back to the top level to express what you want to solve in that sub level, that domain language that you built.

It’s a human problem solving process, doing a totally top down is only something that probably Java programmers can do.

Juan:

I guess that’s why we don’t do it, because it’s not easy. What do you think?

Jans:

Well, that’s what I started with. Modeling is human problem solving. Part of it is symbolic, part of it is based on experience, part of it could be explained by neural networks. But it’s a very complex human activity. And I have not seen any technology that could help. I mean, so all these beautiful UI based systems that automatically can do the mappings. They all work for the first 70%. And then programming is involved in combinations of objects. And then if this is in the object, then we want to go there. 

So suddenly, this beautiful tool that you built… suddenly, you have to add programming and you have to add JavaScript of Java or whatever else to your ETL tool. And suddenly it’s a very complicated thing and then you’re back to programming anyway. So I’m radically in favor of just use programming for data modeling. Actually now I’m making a distinction between data modeling and ETL, but they’re closely related of course.

Juan:

Can you give us your definition or explain how [the entity event model] works?

 

Jans:

To begin, I think in the old fashioned AI frame based systems, the early version of object oriented systems where an object is just a set of triples with the same subject. I really think in terms of objects to begin with. [An example] is healthcare. So I might have an in-patient encounter. I’m going to the hospital and I check in and then say four hours later, or 40 days later, I check out. That was one event. Now that event is an object with a start time, an end time and a type and then a few other key value pairs that might describe the event.

But the event also has sub events. I went to this specialist. And then this specialist did this particular diagnosis. So the diagnosis is an event, where again, usually you don’t have an end time. It’s just, this is the time of this particular diagnosis. And then you get something prescribed or you got a particular procedure, but again, the symptom, the procedure or the medication order are just, again, objects with a time, not always an end time and a type and some key value pairs. But the shape of the objects is always the same. It’s an object with a type, start time, end time and a few other things that make that event a little bit more different, but the shape is always the same, that simple object. And you look at it also as a temporal object.

So you can imagine the chaos that [these events] give. But in just one enterprise data warehouse, they have 250 ways to describe time. It’s fairly systematic, but still a human being has to remember 250 ways that you think about time. If you have the entity event model, you can be guaranteed that there’s a start time and there’s an end time, or not an end time, but there’s always a begin time. That makes it really simple. 

Tim:

When you think about data centric knowledge and entity event models, are these things connected in terms of your data centric foundation might be based on this type of an approach?

Jans:

Well, the answer is yes, that’s easy. It’s specifically built to support many different types of use cases, although when you look at our approach to entity event model usually covers 90% of the data. And then there’s 10% of the data that is just impossible to chart. In healthcare, it’s the 180 taxonomies and ontologies that we use. I mean, there’s no way you can chart that. So we have a model to deal with that in this approach. And for almost every application, I can come up with the 10% that cannot be charted. So you need to have a mix of entity event approach together with something that’s more of what we traditionally call “knowledge.”

Key Takeaways:

  • It’s “terrible” to start creating an ontology without knowing the application
  • Intelligent people make the schemas… this is not easy
  • Modeling is human problem solving!

Visit Catalog & Cocktails to listen to the full episode with Jans. And check out other episodes you might have missed.