Kubernetes Makes It Easy To Deploy And Manage Massive Databases

In this episode of TFiR Let’s Talk from KubeCon + CloudNativeCon EU, Swapnil Bhartiya sits down with Sanjeev Mohan, Principal Analyst at SanjMo, to discuss the key trends he is seeing with data and Kubernetes. Mohan is amazed at how Kubernetes is enabling companies to deploy databases more effectively and efficiently.

“A few years ago, if I were to deploy a database on hundreds of nodes, it would take me days. What if a node went down and I had to reinstall it? Today, what I’m seeing is an extreme scale of databases provisioned literally within hours,” he said.

Observability was one of the hottest topics of discussion at KubeCon this year. Mohan shares his views on its challenges and why he feels organizations need to change how they see the role observability plays in the pipeline.

Key highlights from this video interview are:

Deploying a database on hundreds of nodes would have taken days just a few years ago; however, nowadays it’s much quicker utilizing Kubernetes’ deployment automation. Mohan discusses the trends he is seeing with Kubernetes adoption and how it is progressing.
Although Kubernetes was originally meant for stateless workloads, this was problematic for data. Mohan explains why asset compliance has become such an integral part of doing stateful management using Kubernetes. He shares his insights into why data has lagged behind applications.
Mohan talks about how Kubernetes is helping not only to manage the growing amount of data, but also deploying it faster and extracting value from it. He explains the dichotomy we are seeing with large companies using Kubernetes to build their own databases while others are using serverless APIs, where the end-user does not see Kubernetes.
Mohan shares his views on the three key trends he is seeing in the Kubernetes space: data observability, streaming, and semi-structured and unstructured data. He goes into detail about how they are evolving and being adopted.
Observability needs to be part of a two-pronged approach, observing and then taking action. Mohan sees a trend where many people think observability is one of the many components in the environment. However, he believes that it needs to be seen as the orchestrator of the pipeline.
Mohan talks through the challenges he sees with data and analytics architecture being micro-segmented to a point where it is specialized further and further. He explains how this is complicating the stack by having to stitch together the different pieces.
Many feel that silos are being broken with the DevOps movement; however, Mohan has a different perspective. He explains the trends in the transition of how architecture has evolved, alternating between centralized and decentralized, and what he believes will be the outcome in the future.

Connect with Sanjeev Mohan (LinkedIn, Twitter)

The summary of the show is written by Emily Nicholls.

[expander_maker]

Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed.

Swapnil Bhartiya: Hi. This is your host, Swapnil Bhartiya and welcome to the second day of KubeCon and Cloud Native Con here in Valencia, Spain. And today, we have with us, Sanjeev Mohan, principal analyst at SanjMo. Sanjeev, first of all, it’s great to have you on the show.

Sanjeev Mohan: Thank you so much for inviting me.

Swapnil Bhartiya: I have been, of course, watching your talks on the Kube and other areas as well. So there are so many things I want to talk to you about. So first of all, it’s great that you are here so now we can talk for [inaudible 00:00:30] in person so that’s a different energy altogether. But before we get into all the weeds of Kubernetes adoption, tell us a bit about yourself and the company itself.

Sanjeev Mohan: That’s great. Thank you so much. It’s such a pleasure to be here in Valencia, Spain. I mean I’m so glad we get to travel again after a gap of over two years. Up until last summer, I was a Gartner analyst. I was there for many years. I ran the agenda for data and analytics space, and it’s such an amazing place. I learnt a lot, networked a lot. And then last year, in fall, in August, I decided to just go on my own, become independent. And now all of a sudden, my world has exploded. So now, I not only am an analyst. One day, I’m an analyst, next day, I’m organizing an analyst day, marketing event for a client. Then the following day, I could be doing sales enablement, and then I could be talking to an investor about what’s a market pulse about investing, and then I could be doing recruiting. So I get to basically work with much larger number of clients and do a variety of things, and that’s my life these days.

Swapnil Bhartiya: Excellent. That’s an excellent, incredible change from what you’re doing and then doing your own thing. Now, let’s talk about where we are, KubeCon.

Sanjeev Mohan: Right.

Swapnil Bhartiya: First of all, tell me a bit about what kind of trends you are seeing in this space, first of all, the fact is that Kubernetes adoption is growing as usual. And the difference is that in early days, of course, with a lot of technologies, the early adopters are sometimes tech behemoths who have all the tech know-how, skills, they want to play with the… They don’t mind getting their hands dirty, but now the Kubernetes adoption is also happening at a smaller scale companies also. It’s more or less like Linux’s story, everybody… The joke will be that my toaster runs on Kubernetes now so… Which also means that the kind of challenges, problems, opportunities also change because the company at larger scale has a different challenge versus smaller scale.

Sanjeev Mohan: Right.

Swapnil Bhartiya: We are seeing adoption of low-code/no code because everybody wants to get on the band. So in general, broader, what trends are you seeing in this space? So two things I want to look at. First of all, what is driving the adoption? And because of this adoption, what is happening to the rest of the industry?

Sanjeev Mohan: So Swapnil, one thing that I want to clarify right off the bat that I’m not in my home grounds. I’m at my peak when I’m at a data and analytics event. Here, I am literally in the midst of developers, infrastructure people, DevOps, SREs. So I bring a very different perspective. Even at this conference, who are the companies that I’m visiting mostly? I’m going to mostly database companies that have big, large presence here. For example, DataStax, CockroachDB, Hazelcast, the whole bunch, EnterpriseDB. So what I’m seeing them do with Kubernetes is phenomenal.

For example, if I were to deploy a database on hundreds of nodes, just a few years ago, it would take me days. What if a node went down and I had to reinstall it? Now, what I’m seeing is extreme scale of databases provisioned literally within hours. How? You have a YML file. In the YML file… I mean it’s advanced to a point where we can truly do hybrid multi-cloud with on-prem and multiple cloud providers. How do I do it? I can have a YML file and I can say provision hundred nodes in AWS, hundred nodes in Azure or GCP, 200 nodes on premises, and I can do all of that in a YML file and I can let it run. If a node goes down, it auto heals and it auto starts. So this is how Kubernetes is really automating our deployments, management, monitoring. All of that is becoming a cinch these days.

Swapnil Bhartiya: You talked about database, initially, when we were looking at Kubernetes, but more or less like stateless workflow, but now you’re looking at state full workloads.

Sanjeev Mohan: Yes, correct. Yes.

Swapnil Bhartiya: So how have things changed there?

Sanjeev Mohan: Yeah.

Swapnil Bhartiya: Can you talk about that as well?

Sanjeev Mohan: Yeah. So because I’m representing the data and analytics space, so it’s been an amazing journey for the data space. Data space has lagged application by years. If you look at DevOps, 10 years ago, applications adopted DevOps, domain-driven design, application, trading products as a… All of these things have been around for a long time, but data is only now catching up to it.

Same thing with Kubernetes, when Kubernetes first came out, it was meant for stateless workloads. So if a node went down, then Kubernetes would detect and it would restart that node. But if a data node goes down, that’s a huge problem because that can corrupt your data and make it so inconsistent. So asset compliance becomes a really, really important part of doing state full management using Kubernetes.

Now we have it. And now the database vendors have wholeheartedly adopted it. How we are doing it? Through an operator, the operator understands what are the needs of the database. The operator has been written and optimized for that database provider. And it knows what is the most optimized and efficient way of delivering that service.

So just one more thing I want to say is that there’s a reason why data is behind applications because applications don’t change very frequently. You write an application, you deploy it. Data changes every second. I could have data drift. I could have schema drift. My machine learning models could have a drift. So it’s very key to understand what’s going on in the data space as opposed to the application space. So data is much more dynamic. People’s authorizations change. Compliance laws change and so we have to do data residency. There are a lot of extra rigor that data needs which applications don’t or the infrastructure.

Swapnil Bhartiya: Not only is the data changing, but the size also grows. Application size will remain the same.

Sanjeev Mohan: Correct.

Swapnil Bhartiya: And then also, you will be putting data on a data lake or data warehouse. There is a whole debate going on, which one is better.

Sanjeev Mohan: Correct.

Swapnil Bhartiya: Plus data itself has no value. You have to extract value from it.

Sanjeev Mohan: That is correct.

Swapnil Bhartiya: So, can you also talk about, first of all… I mean once again, it can be so many questions bundled together so I’ll throw it one by one. One is that where to put your data? I mean I don’t want to get into a debate on data warehouses versus data lakes versus whatever it is, but petabytes of data is there, but you have to analyze it and then where do we analyze it also matters. So how is Kubernetes kind of helping with not only as the data is growing, but you should be able to deploy it faster and also most importantly extract value from it?

Sanjeev Mohan: That’s an excellent question. You know why? Not because of this debate, data warehouse versus data lake versus lake house, all that is for a different time, but it’s interesting because Kubernetes plays a role in bringing up these databases, but there’s a alternate thing that’s happening these days and that’s called serverless. And in the serverless world, there is no Kubernetes. There is Kubernetes a lot, but it’s abstracted from the end user. So the end user doesn’t even care whether it’s a data warehouse or a data lake and which cloud it is in. Data is now being accessed as an API, actually as a product.

So I have an API that says insert the data into a database and the API goes somewhere into the ether and magically, it’s there. Now when I want to retrieve that data, I can use SQL or I can use Ruby, C, C++, C#, Python, whatever I want, and I can extract that data through an API. I don’t even know where it’s installed, that database is. I don’t care whether Kubernetes is being used.

What is interesting in this whole scenario, the provider is using Kubernetes to stand up my container and serve me that data. I, as an end user, don’t see Kubernetes. It’s hidden away from me. So we are seeing this like two, a dichotomy. There are large companies that are using Kubernetes to build their own databases, and then there are these companies that are using serverless APIs and they don’t even care as to what’s the underlying engine.

Swapnil Bhartiya: We are like almost in the midst of 2021 or 2022, it’s hard to say which year that is. Now we’re getting out of COVID so it’s easier to know which year we are in. What kind of trends are you expecting to see that will grow with or you’re like hey, [inaudible 00:10:24], we are hitting the peak and these are the challenges that the community will face?

Sanjeev Mohan: Very good question. So there, I’m seeing a lot of very interesting trends. One of them is in the space of observability. Observability is huge at KubeCon this week. But again, I present a different view. So the observability that’s big down in the exhibit hall… I know you’ve been busy recording these videos, but when you get chance to go down, you’ll see there are all of these like Datadog, New Relics, all of these that do what’s called observability data. What they’re doing is they’re doing log analytics. They’re taking this data, Elastic is big in this space, taking that data and analyzing the logs.

What I see as a big trend is the opposite, which is data observability. Data observability is understanding the life cycle of data. It is understanding things that I just mentioned like data drift, but data quality. I expected X amount of data, but I got half of it, the context of data. Performance of the resources, how many compute, how many spark executed, worker nodes did I use and why, understanding the cost. So that data pipeline, that understanding, monitoring and predicting the performance of the pipeline is what we call data observability. So it is opposite of observability data, which is more of log.

The joke I have is that the data observability, companies that I spent time with, they all want to be the Datadogs of the world. So Datadog is like this, the marquee name, and the funny thing is Datadog does everything, but data, even though data is in their name. And now, of course, when they listen to this, they’ll be like, “How dare you. Data is what we do.” But what I mean is it’s not like the data that I’m talking about, which is what is the meaning of customer ID versus quantity of goods sold and stuff like that. They’re doing data, which is logs and data… So that’s a very big trend.

The second big trend is in streaming. So streaming, either ingesting streaming data and doing anomaly detection or all kinds of analysis on streaming data directly, but like Confluent, ksql, some of these things do. Now those capabilities are being adopted by everybody. So that’s streaming ingestion or streaming data. Also, streaming data as an egress, what we call change data capture or CDC. So taking the data as it comes in and moving it into an analytical environment like a data lake. So streaming data is huge, data observability.

The third thing is unstructured data, semi-structured and unstructured data. For the longest time, we built our catalogs. We’ve done a lot of development on structured data, but most of the data within an organization is not structured. It’s in PDF files. It’s in images and videos and all kind of stuff. So now, we are getting better at understanding, extracting that meaning, the entities out of this unstructured data, cataloging it, making it discoverable and doing analysis on it.

Swapnil Bhartiya: Excellent. Since you’ve brought up observability, I have a question for you.

Sanjeev Mohan: Sure.

Swapnil Bhartiya: And which is like… I’m an [inaudible 00:14:03] developer or DevOps. I’m an observer from outside.

Sanjeev Mohan: Yes.

Swapnil Bhartiya: No pun intended. So as somebody who’s observed from [inaudible 00:14:10] and I look at observability, it’s more about to know what happened with the system and then analyze what happened. But what happened to the action part of it, to take an action? Learning what is happening is only half a problem, but-

Sanjeev Mohan: Yes, it’s definitely my favorite topic. I have been saying this in some other forums that understanding what went wrong is like, okay, great. I mean I know there was a breach and… But what do I do with it? So I have a point of view that observability should become the master or the leader that orchestrates actions downstream. So by observing the data, I should be able to say that there’s a data quality problem. Mr. Data Quality Product, go remediate it, go fix the data quality problem. Then I send it to Airflow for orchestration. Then I say, okay, the next step is reporting or notification or whatever it may be. So observability space needs to become the master orchestrator of the pipeline. And that’s a very opposite way of thinking because people just think observability is one of the many components in my environment, but it actually becomes the orchestrator.

Swapnil Bhartiya: But another thing with observability when I look at it is like, of course, you look at CNCF Landscape, their different projects, but when I look at it that if you look at a business, they are simply trying to solve a specific followup for their customers. All those things are second. Yeah, they do have to do that for that app to run successfully, but it becomes so complicated. In the end, what I want is that, hey, you know what, I developed application, I wrote application. I’m running it. I want it to run this smoothly, whether it’s security issue, whether it’s bug, whether it’s reliability so… But we have broken things into observability, there’s securities, there’s reliability. You use some jargons I think, but what is your overall perspective that what should be a project? You said that observability should be the master. We should be driving everything.

Sanjeev Mohan: The problem that we are running into is that we have micro-segmented our data and analytics architecture to a point where we just keep specializing, specializing, keep slicing, Reverse ETL, that’s a separate topic, metrics layer, feature store. So the CIO or the end user is like, wait, I just need to deliver value to my business. I don’t understand how many different products do I need.

Observability is interesting because in observability, people could be only doing data quality or they could be doing performance or they could be doing cost, but the CIO also needs security and privacy. So they need to know that an S3 Bucket was left open by mistake and it needs to be fixed ASAP. So you see? Or this data belongs to EU and only people in EU should be able to see it.

So we have a problem actually where the stack is getting complicated. It’s getting disaggregated or unbundled into multiple components, and we are expecting the CIO or some end user to understand how to bring it all together and that’s not going to happen. This is actually the reason why we are seeing an emergence of bundled software, an emergence where companies are saying, I just want to buy one product, maybe that’s 80% of what I do, but it takes away all the overhead of stitching together different pieces.

Swapnil Bhartiya: Sometime I feel that with the whole DevOps movement, cloud native movement, we were going to break old silos. But for those, silos were different, but I do feel that we are kind of creating new silos because silos is more about expertise-

Sanjeev Mohan: A data lake is a silo.

Swapnil Bhartiya: Yeah. And when I bring it up, sometimes people say, no, I don’t look at end silos. I look at federated. But even in that case, you see, they don’t… I mean they’re trying to make them talk to each other, but there is once again, specialization. That’s what we are trying to break with the whole DevOps, devs and ops and then, DevSecOps also come in. So do you think that we are still kind of moving in the same direct like fashion, then we’ll try to break these silos once again.

Sanjeev Mohan: So very interesting history of this transition of how architecture’s evolved. If you go back to the modern era of computing, you start with mainframes. Mainframes, everything was a silo. It was centralized. You had to buy a mainframe from IBM or Amdahl or some Hitachi or some of those companies at that time and everything was in there. Then we got into the era of PCs. So we decentralized everything. Everyone had a PC. Then we put Windows and we centralized it because everybody had to use Windows. Then the internet came and we decentralized it with the internet. But then Snowflake came and said, no, just be on Snowflake. We’ll solve all your problems. So we, again, centralized.

Today, we are back in the era of being decentralized. So now, we are talking about decentralized apps, decentralized architectures. Web3 is coming in, all the crypto and all of that hyperledger and all that. So we are again, decentralizing, but it’s a pendulum, it’s going to shift and at some point, we will centralize again. It’s just how our technology just shifts. So right now, we are just in the era of inventing new stuff and adding more and more components to the stack.

Swapnil Bhartiya: Sanjeev, thank you so much for taking time out today and, of course, sharing your insights. And it was an incredible discussion and I would love to have you back on the show, whether online or in persons, but I really appreciate your time with-

Sanjeev Mohan: Thank you so much. I am so happy I met up with you at this conference and I’m sure our paths will cross again. So thank you so much for inviting me. Appreciate it.

[/expander_maker]

Kubernetes Makes It Easy To Deploy And Manage Massive Databases | Sanjeev Mohan

Organizations Lack Confidence In Their Open Source Software Security: Report

Exploring Key Trends In Securing The Software Supply Chain With Kenny Johnston, GitLab

Organizations Lack Confidence In Their Open Source Software Security: Report

Exploring Key Trends In Securing The Software Supply Chain With Kenny Johnston, GitLab

You may also like

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

Why AI-Generated Code Needs a Cloud Sandbox to Be Trustworthy | Waldemar Hummer, LocalStack | TFiR

Why Cloud Spend Now Drives Company Valuation | Peter Maloney, Azul | TFiR

Why Enterprises Should Stop Building AI Infrastructure Themselves | Richard Borenstein, Mirantis | TFiR

How to Build Safe, Production-Ready Kubernetes Clusters at Scale | Corey McGalliard, Akamai Cloud | TFiR

Why AI Agent Logs Are Not Enough and How to Get Cryptographic Proof | Yaron Schneider, Dapr | TFiR