Open source software is one of the largest and fastest growing segments within data and analytics. That has led to a proliferation of AI/ML tools and data services. What role and responsibility do companies have with respect to algorithm transparency and how data is used?
Catalog & Cocktails hosts Juan Sequeda and Tim Gasper were recently joined by Denise Gosnell, CDO of Datastax, to talk about the business of open source and how community-centric data applications are reshaping the enterprise. Below are a few questions excerpted and lightly edited from the podcast.
Honest, No B.S. question: What is the future of open source data?
Denise Gosnell: I mean, if I had to really pick one central theme, I would love for it to be much more transparent, and when I say “transparent”, I mean transparency from the full data stack. All the way from just being able to understand how you fit into these algorithms, when you're a consumer of different data services or people are using your data, to having transparency and observability into how the algorithms are performing. So, to me, the future of open source data is really going to be transparency. I think our entire system right now is very opaque in how data is used, and I would just love for it to start to be a little bit more observable.
Who are the consumers of these open source tools?
Denise: The first place I would look would be just the space of enforcing GDPR. If you are supporting customers in the EU, you better be doing this. But you have to have pathways to enable any user to request all their information and delete it, right? That's GDPR. I think one of the first ways that we could create more observability around that question would be a tool for your data teams. It would be a way to more easily track and have the ability to have that provenance information readily available so that you can answer that question much more quickly, because if I understand it now, it's mainly customized pipelines that people are making in order to answer those questions.
Tim Gasper: I love things like dbt. [They] are getting a lot more traction in the community, because obviously, at its core is an open source project. As people start to build more open source componentry into their stack for data pipelining, I feel like that's going to lend itself more towards this ecosystem expanding, because it'd be nice if all these things could really work well together...
Data ecosystems are expanding...does that sound right or not?
Denise: There needs to be a new word for “data observability of systems.” [It’s going to be] essentially “traceability of data.” When I'm thinking about transparency and visibility into data lineage or data provenance, I'm kind of thinking that there would be a way to understand every component of the system that that piece of data has traced through, and that's essentially at the end of the day, the question you needed to answer for GDPR. For this customer, I need a quick map of everywhere that their data has touched.
Tim: It seems like maybe we've got some of our foundation now. We've got our databases and our data warehouses. We have some of our data pipelining. We've got our orchestration, with things like Airflow and Dagster. So maybe the next frontier here is really more the meta layer, the observability, the transparency, the lineage. Its understanding, and the more that enterprises and companies can understand their own data, perhaps, the better that they'll actually be able to help consumers understand what the hell is being tracked about them, because if a company can't answer the question themselves...
How are [companies] going to then actually serve the consumer in this situation?
Denise: Yeah, absolutely. To your point, I think that there is a growing motivation to make this more tractable for inside businesses and eventually to consumers. I think we're getting more pull today from the consumer side, and that's where you're starting to see the explainable AI movement really come into play with people wanting to understand what this algorithm is doing for me and how do I really fit in.
We feel that anytime we're on the internet today, when it feels like it's incredibly tense to have a conversation because it's painted a picture that there's side A versus side B. But that just happens to be a byproduct of algorithms that are optimized for click-through-rates and time-on-page. It creates this silo and division that actually might not exist in the real world, but that's a part of our digital experience now. So yeah, Tim, I just thought that your commentary on that really fit in with where explainable AI is being pulled from us, the data of practitioners and tool builders from the consumers.
What open source tool best personifies your personality, Denise?
Denise: Yeah. Well, I mean, I can't answer “graph” because that's going to be way too obvious. So I'm going to go with Apache Airflow and I'm going with [that] because it's connecting data sources together for managing your data, your data tasks, etc... I mean, who doesn't love anything more than connected data. I, of course, love connecting data so, I'm going with Apache Airflow, I freaking love it.
Key takeaways
- We need data observability and data traceability
- We should be pushing back on how our data is actually being used
- Figure out what empathy looks like for yourself and for others
Visit the Catalog and Cocktails page to listen to the full episode with Denise, any prior episode you might have missed and see upcoming guests and topics.