Cloud Native

Why Data Architecture Matters For AI Workloads | Rob Schmit, Egen

0

In this episode of “An Eye on AI,” we sat down with Rob Schmit, Principal Architect at Egen, about data platform requirements for AI implementations. The discussion addresses how organizations can build data foundations that support AI initiatives beyond model training.

Schmit argues that organizations using off-the-shelf AI models still need robust data platforms to evaluate model performance in their specific environments. Standard evaluation datasets like SWE-bench provide general model comparisons, but organizations need their own data to test models against their use cases and workflows.

The conversation covers several implementation approaches. Data mesh strategies work for complex organizations with multimodal data from different sources. Traditional centralized analytics teams remain effective for many use cases. The key factor is having domain expertise available within the organization rather than forcing teams to learn data characteristics from scratch.

Unified Lake House architectures provide consistent storage patterns and enable team training across organizations. However, Schmit emphasizes that the approach requires more than just data storage—organizations need comprehensive enablement platforms with automated CI/CD, testing frameworks, and data catalog systems.

The discussion addresses data governance challenges, including access control across different data layers and establishing clear roles for data producers, transformers, and consumers. Organizations need policies that define acceptable usage patterns while enabling data access requests and support processes.

Technical decisions require evaluation of both implementation paths and exit strategies. Platform choices made six months ago may need revision as AI capabilities evolve from text-based RAG systems to multimodal approaches including audio, video, and multi-agent workflows.


Edited Transcript

Swapnil Bhartiya (0:00): Welcome to a brand new episode of AI on AI, and I’m your host, Swapnil Bhartiya. Today, we are diving deep into the critical intersection of data architecture and AI potential. As organizations navigate increasingly complex data landscapes, the question emerges: how can they build robust foundations that truly unlock AI value?

Joining me today is once again Rob Schmit, Principal Architect at Egen. Rob brings invaluable expertise on modern data platforms, unified Lake House approaches, and implementing governance strategies that enable rather than hinder innovation. In today’s discussion, we will explore how organizations can balance technical considerations like open standards with business needs for managed services, and also examine how AI itself is revolutionizing data management practices.

So whether you are a data leader charting your organization’s AI journey, or a practitioner working in the trenches, today’s discussion is going to offer practical insights to help you maximize your data platform’s AI potential. So without further ado, let’s go and talk to Rob.

Rob, it’s great to have you back on the show.

Rob Schmit (1:51): Great to be here, thanks for having me.

Swapnil Bhartiya (1:53): Thanks for joining me today. Before we jump into this discussion, I would also like to understand—when we look at organizations who are, I mean, of course, almost everybody is embracing AI and they are very far ahead in the journey of AI as well—but can you talk about what is the role of data and data platforms when organizations look at maximizing the potential of AI?

Rob Schmit (1:51): Yeah, absolutely. It’s a great question, and I think it’s something that a lot of organizations are trying to grapple with right now. You know, a lot of people think of data as being the gasoline that fuels the engine in terms of generative AI or LLMs or machine learning in general. And that is certainly true. But even for folks who aren’t necessarily going to be training their own models—they’re going to be using open source or off-the-shelf models and fine-tuning them or using them in their own organization—a foundational data platform is still a critical component.

Because you need to have your data in order to be able to determine how effective these models are in your environment. You know, we can all look at the various evaluation datasets—SWE-bench and Humanity’s Last Exam—that are out there that are great for stack-ranking models against each other. But in terms of your own use cases, if you don’t have those cataloged and available and accessible so that you can run your own evaluations against these models with the pipelines that you’re building and deploying for your organization, you’re going to be behind the eight ball, and you’re going to struggle to really make a lot of progress against the kinds of things you’re trying to do.

So, an example of this is there are a lot of workflows out there that people are looking to augment and enhance. But if you don’t know that a particular document or particular question you’re going to ask of a document, for example, what the answers to those are and the various permutations of it, and you don’t have a solid data foundation to be able to execute those tests at scale, you’re never going to be able to know how well you’re performing or if you’re actually improving the workflow overall. So a fundamental data platform is an incredibly important foundation piece for doing generative AI and LLM successfully in an organization.

Swapnil Bhartiya (3:35): Excellent. Thank you. Now, can you also talk about—looking at what you just explained, the importance there—what are organizations already doing, or what they should or can do to manage data complexity and the need to unlock value, particularly when they are capitalizing on AI?

Rob Schmit (3:53): We see a lot of different approaches across organizations, and by far one of the most common patterns that we’re seeing, which has been around for a little bit, but people are just starting to get their hands around it, is the data products or the data mesh approach. We’ve been working with companies using this approach for a long time, and when you get to certain levels of organizational or business complexity, it’s kind of the only way you’re able to truly make sense of the data you have in your organization, because it’s coming in from different endpoints. It’s multimodal—it’s audio, it’s unstructured text. It’s not just logs or clicks or stuff like that that’s coming in.

So you often need to think about how things are operating within a domain, and how you can organize that better, and how you can present that data out to the broader organization in order to make it useful. And that comes from moving through the core production systems—the systems that are producing this data—into the aggregation and the analytics warehouses that are putting the data together in a transformed, clean form, and then the actual surface layer where the value really comes from when you expose it to your enterprise and when you expose it to your customers to be able to take action against the things that they’re trying to do.

So we see that a lot of places, and what we also see a lot of traditional techniques in a lot of places—people with centralized data analytics teams that are doing this stuff, and that works just as well. It all very much depends on where you’re at and what you’re trying to do. But what really, really matters, no matter what technique you’re using or what approach, is making sure that you have the right level of expertise in the domains that you’re trying to impact, particularly if you’re using machine learning or using AI. If you don’t have that, and you don’t have that accessible in your organization, that’s where the struggles come from. When people have to come in, and they have to learn how the data feels, and what the edge cases are in that data, and they have to start from scratch every time around—that’s where you see organizations lose a lot of velocity when they’re trying to do this, when they’re trying to really move fast. And that’s what those techniques enable for a lot of organizations.

Swapnil Bhartiya (6:02): As you know, not everybody may have that expertise, and also I, as a journalist, I also love the innovation that is happening, though it keeps me on my toes, and I’m pretty sure it keeps you also on your toes, because every day something new is happening, which also keeps us excited. There’s no burnout, because something new is exciting, but that is not very good for practitioners, because they are challenged with that. What is the role of companies like Egen in this ecosystem, so that folks can continue to move at the pace where the innovation is happening, without disrupting business continuity, without taking the risk of dipping their toes in unknown waters? So they have a trusted partner. Can you talk a bit about the role of Egen in this space?

Rob Schmit (6:49): We’ve been working with customers, both on the data platform front and in the AI space now for over 15 years, and we’ve built some very, very large, hundreds-of-petabyte data warehouses with our customers, and we’ve worked to help manage this stuff with them. And I think that’s the thing that working with a partner like Egen brings to bear on this—we have that broad experience across lots of different industries and lots of different verticals and lots of different problem spaces that we can bring to bear for an organization who’s trying to figure out what the right way to move is.

We have a lot of the scar tissue of past initiatives and mistakes and learnings, and all of that stuff is really important for an organization, because it’s really hard when you’re starting from scratch. And even now in the AI space, it feels like you’re starting from scratch every six months. And our projects and where we work, we have new problems kicking off all the time. So we’re continually gathering that information and expertise, kind of like a vanguard for your organization, figuring out some of these problems and recognizing patterns that apply across industries so that we can bring them to bear for a lot of our clients and customers.

Swapnil Bhartiya (8:08): What is the role of a unified Lake House approach in helping organizations manage data architecture challenges? Because this is something that’s internal to them. Even a company like Egen can only come from outside, but how should they handle it internally?

Rob Schmit (8:24): I think a lot of what I recommend folks do is take a source-to-destination approach, right? The unified Lake House is an important component of it, and that’s—if you look at like Databricks or some other areas—the silver layer, right? That’s where that stuff sits, not necessarily the source transformation layer, but in the middle aggregation layer, and then you have maybe a presentation or a gold layer down the line.

Having unified data Lake House simplifies things for a couple of reasons. One is that it gives you a consistent storage pattern. You can train people against it. You can enable your—you can build repeatable structure patterns that people can bring to different organizations. It helps you go faster, right? That’s the whole purpose of standards—a lot of the decisions are made for you.

What’s important is that you don’t just say, “and now we have a unified data Lake House.” There’s a lot of other enablement and a lot of other decisions that need to be made in order to make that effective. And a lot of what I think people need to do is borrow from our friends who have traditionally been doing application development, where they know that having an application development platform accelerates teams, where there are guardrails, having automated CI/CD and testing—things that have existed a lot in the overall programming and computing space, but maybe on the data science side of the house, or the data engineering side of the house haven’t really necessarily been used for a lot of other reasons.

So it’s important to have the Lake House as a standard. But you also need to think broader than just where my data lives and how it’s stored there. You need to think about, how do I enable my teams to get data in and out of it without hamstringing them with a lot of decisions and red tape and compliance and all the other things? You want that to be part of the data Lake House platform. And I think that’s a transition that folks are starting to make, and they’re starting to think about it a little bit more broadly than just, “okay, the data goes in and it lives here, and I can query it, and it’s great, it’s magical.” But how do I then make sure that the data is coming in the right way and going out and being presented and aggregated and harmonized in the right way as well?

We see a lot of folks working with data catalogs. It’s not just where is my data, it’s not just the tables. How do I make it discoverable? How do I make it usable? How do I make it verifiable? How do I put SLAs around this data? Those are a lot of the things that we’re helping customers with nowadays—getting to that point where they can start, rather than having this sprawl of data warehouses across their organization, they have one place where things are going in, and it enables them to do things a lot faster.

Swapnil Bhartiya (11:08): And that also brings us to the territory of—when you mentioned who has access to where, where the data is moving—data governance and security. So can you also talk about when we look at this space, robust data governance, security, and quality are also crucial from raw data ingestion stage to curated insights. How can organizations prioritize and implement these policies, and what policies are those across their data platform today?

Rob Schmit (11:41): Obviously, governance and security are top of mind for all this, especially with the advent of AI and things we’re all doing with it, and making sure that we’re presenting and giving access in an appropriate way to those agents is an important piece. A lot of it comes down to, in my opinion, how do you know what layer and what level do people need access to? How are we presenting that data out to the organization, and then what tools do we have available in order to secure them?

So some folks are going to need access to raw tables, raw data, raw data coming out of a production application or a database or something like that, in order to do their job. And that’s one set, right? And then you’ve got folks who are going to need access to maybe the harmonized data and insights, because they’re the ones that are building the gold layer stuff. And then you’re also going to have your surface folks who need access to a dashboard or something like that.

So understanding how you basically set the rules for those datasets as they move through the system is an important piece. We see a lot of folks trying to do this via a data catalog, and that’s a great way to start. What really, really matters, though, is having really good policies around what is acceptable, what the rules are—essentially, what we’d be calling a data constitution for an org. Here are the roles and responsibilities for the people producing the data. Here are the roles and responsibilities for the people transforming the data, and here are the roles and responsibilities for the folks who are consuming the data—and who owns what, who is responsible for the security, and what layer can we grant access to, and how are we putting this—for this particular data product or data warehouse, what are the patterns that are acceptable for it?

And you need to make sure that that’s communicated to the org through the data catalog, or whatever solution you choose. That’s really where it will center, and also, it’s not just about saying yes or no this person can have access to it, but making it possible for people to request access, making it clear who owns these data assets, or these data products, so that people can get help. People can get support. They can tell you when something’s wrong. There’s a lot of the security and governance that goes around it that isn’t just about IAM and access policies. It’s really thinking about what is the role of the producer and what is the role of the consumer, and marrying those two things together and then using either the data platform or the data catalog to be your main layer of enforcement of those practices and standards.

Swapnil Bhartiya (14:22): When we look at data, balancing the use of open standards and formats with the desire for managed services and reduced operational overhead is a critical decision. How should organizations approach this trade-off in their data infrastructure strategy, because they do lose some control and they lose a lot of flexibility there as well?

Rob Schmit (14:46): What we really think about, or what I advise and what Egen advises clients to think about when they’re making any kind of technology decision, regardless of whether it’s for data or anything else, is think about what your on-ramp is, and think about what your off-ramp is.

So an on-ramp is not just the technical capabilities of the underlying infrastructure, software, warehouse, or database or whatever is going on to it. It’s how am I going to get my folks comfortable with using it? How are we going to get it integrated into our systems? What is the effort and what is the lift going to be in order to do that? And how do we make sure that we are going to maintain and keep this software effective for the life of the organization, or for the life of the use case that we have?

The next thing to figure out is, how do we get away from this? Because, as we all know, today’s code push is tomorrow’s tech debt, and eventually, the decisions we make need to be revisited and they need to be changed. And locking yourself into a particular pattern or particular solution might not leave you with a lot of options to get away from it. And that’s not only just from a spend or cost or licensing perspective—even in the open source world, making a choice to go down, you know, Iceberg versus Delta Lake, right? There are ramifications about how you would have to shift if you want to go from one to the other. There are ramifications in the infrastructure or the cloud services needed in order to provide that infrastructure at scale. And then there’s the retraining aspect of it—your team is familiar with these technologies. They know how to use them. Now, you’ve got to figure out how we’re going to enable those folks in order to do that.

So taking that into consideration, I think, is really important for most organizations as they’re going through this journey. It’s not just about going out and buying something, right? It’s about thinking holistically about the platforms and use cases and the things that are going on in your organization and how this solution is going to support them.

And in particular, we see this more on the Gen AI side, where the platform you built six months ago, you have to rip apart and redo because the models are better. They’re newer. They’re more capable. There’s more that you’re able to achieve through it, and you’re able to just do more. We’ve moved to the point where, back in the day, and I’m talking 18 months ago, we were just in pure chat, text, RAG-based modalities, right? And you look at today, and we’re throwing audio at these things, we’re throwing video at them, we’re generating images, we’re running multiple agents together as judges, doing ELO scores and all sorts of stuff like that, where if we didn’t really take a conscious look and think about the technology and the foundational stuff that we were using, redesigning that would have been a very, very painful effort, and that would have left us unable to take advantage of the next thing coming down the pipe that might be better than what we’ve got.

Swapnil Bhartiya (17:48): AI/ML capabilities are being integrated across organizations. Can you talk about beyond data analysis? How do you see AI/ML capabilities integrated into an organization’s data platform to enhance areas like data discovery, metadata management, or proactively identifying data quality issues as well?

Rob Schmit (18:06): There are a lot of different things that are coming out and that we’re playing with right now. An interesting one that has come up recently has been in the Google ecosystem, integrating—BigQuery now has multimodal tables, where you can index unstructured data or audio files or video files or images, and then exposing that via their agent-based technology, particularly to business users. People who need information on the fly or they’re trying to build information about, say, particular clients’ spend, or the recent contracts they’ve had, or all that stuff, and you see like a sales rep going out on calls, using Agent Space to put all that stuff together via the BigQuery integration, to get the information they need in order to better serve their customer, or come better armed to those meetings with solutions or ideas to help their clients out. So that’s one area where you start to see some of this stuff—that advancement of the data platform supporting a lot of these AI-based initiatives.

Another area that is starting to become a little bit more prevalent is the evolution of the data agent itself, who is able to take on some of those lower level transformation, data cleansing grunt work kind of stuff. And I think that’s—I know there’s a lot of debate, and certainly a hot topic in the industry of what are these agents able to actually do from a coding perspective, but in the past, we’ve seen that a lot of the fundamental grunt work in data engineering is cleansing data, acquiring it, making sure that it’s formattable and usable. And I think that’s an area where data agents will actually be able to get that domain expertise. I see it as a value-added function for a lot of these domains that are out there. A lot of organizations are going to be able to potentially train up their own data engineer for a particular domain, that’s a virtual agent that can handle some of the low-level transactions or the day-to-day work that a data engineer [would normally do], and then reserve the real data engineering for the hard problems and for the really big models and the really nasty problems that they’re trying to solve in their organizations. It’s an area that we see—I think there’s going to be a lot of growth in that area over the coming years.

Swapnil Bhartiya (20:31): Rob, thank you so much for joining me today, and of course, thanks for sharing great insights on how organizations can build the right data platform strategy. Thanks for the great insights, and I look forward to chatting with you again.

Rob Schmit (20:43): Thank you. Great to talk to you. Take care.

What Happened Today June 9, 2025

Previous article

From Cloud Native to AI Native: The Next Evolution in Software Development

Next article