Why Metadata Is Key In A Highly Scalable System

Changes in our data needs have evolved over the past 20 years, particularly with the growth of Internet of Things (IoT), and so has how metadata is managed. Existing storage engines were not built to scale and managing metadata has become complex and difficult to manage.

In this episode of Let’s Talk, Adi Gelvan, Co-Founder and CEO of Speedb, takes Swapnil Bhartiya through the current challenges of metadata and storage engines in the modern cloud-centric world. He discusses what motivated him along with the other founders to create the next-generation storage engine and how it is solving these problems.

When asked who is affected by the challenges of metadata, Gelvan responds saying, “Everybody, Everything around us is data. I don’t think I know any technology company that doesn’t rely on data. Data is the new oil. If you want to analyze fast, you need to access the data very, very fast and analyze it very fast. You cannot do it if you don’t access the metadata very fast.”

Key highlights from this video interview are:

Gelvan describes the problems the company had seen with storage engines and why the previous ones were not built to scale. He explains why this motivated himself and the other founders to create Speedb and how it is solving problems with scaling.
Gelvan takes a deep dive into what a storage engine is and why the growth of IoT over the last 10 years has led to a need to create a new data structure that can handle the large volume of metadata.
Over the past 20 years, our data needs have changed dramatically and so has the metadata that contains the information about the data. Gelvan explains the evolution of metadata and its relationship to the data and why managing it is challenging nowadays.
Since every technology company relies on data, Gelvan feels that this is a problem for everyone. He explains why metadata is key to being able to analyze the data quickly and why companies need new technologies to manage metadata efficiently. He discusses why Speedb is helping to solve these problems.
The people who are building the data systems or have access to the architecture of the application depend on the data and managing it. Gelvan details why they are struggling with storage engines which can lead to performance problems.
Gelvan believes Speedb can change the way metadata is managed in the world. He goes into detail about what is in the pipeline for the company, such as going open source and their enterprise commercial version.

Connect with Adi Gelvan (LinkedIn, Twitter)

The summary of the show is written by Emily Nicholls.

[expander_maker]

Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed.

Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya, and welcome to another episode of TFiR Let’s Talk. Today we have with us Adi Gelvan, Co-Founder and CEO of Speedb. Adi, it’s great to have you on the show.

Adi Gelvan: Thanks for having me. Pleasure being here.

Swapnil Bhartiya: This is the first time we are talking to each other, so I would love to know a bit about the company because you’re also a co-founder. So let’s start with some of the basics: what problem you saw in the market which you wanted to solve that led to co-creation of this company?

Adi Gelvan: Okay. So it started with our own experience of trying to embed a storage engine, which is a software layer that actually connects any media with the applications that manage data. Every application that manages data, whether it’s application, cyber security, streaming, database storage, it runs with a software layer that determines how the data is written to the underlying media. This part is called a storage engine. Not many people know about it, but it’s always there.

My co-founders and me, Helic and Mike, were the chief scientist and chief architect of Storage Unicorn. They were in need to embed a storage engine to manage the metadata. They took the most prevalent storage engine in the market called RocksDB. It’s the brainchild of Facebook, and they embedded it in the storage and they found out that it’s working actually great in small data sizes, but not really as expected in large data sizes. They tried to do something about it. When they looked inside, they saw that the whole storage engine market is actually made of architectures that are not built to scale. So we decided to create the next generation storage engine to actually enable the data scaling in applications. That’s what we’re doing at Speedb.

Swapnil Bhartiya: What is a storage engine? If you look at today’s modern cloud native, cloud centric world, where data is everywhere, that’s what we are not only generating, but also consuming. So as to also understand the importance of storage engines.

Adi Gelvan: When you look at the data stack, you’re looking at the application or the user first. You’re talking to a certain application, it can be any given application and the application in order to store data, it will use a database of any kind. It can be Streaming, OLTP, analytics database, whatever database or data structure to manage the data. Underneath this data structure is a storage engine. It can be an LSM tree or B tree based or whatever. Then the underlying storage that we all know can be S3, can be file system, can any storage vendor. So the storage engine is there and it determines how the data is going to be put on the underlying media. The reason people don’t know or didn’t know till a couple of years ago, what storage engine is, is because the main goal of storage engine is to manage the metadata. Metadata is actually the data about the data and you need to access it very, very fast. It usually resides in the memory so you can actually access it very fast. It gives you actually the pointers or the addresses of the actual data.

What happened in the past decade is that the data we’re dealing with in the world has changed. The connected devices, the IOT actually made data much more complex than it was before. Now, the ratio between metadata and data itself is not like a one to 1,000, but more like one to one or one to two or one to 10 and sometimes the opposite. The metadata is much larger than the data itself. So now the metadata is growing exponentially and it cannot reside in the memory. So you cannot access it very fast. When you cannot access the metadata very fast, it actually creates the butterfly effect and you cannot access the data fast. So the users actually get inferior service from the applications and what we’re trying to do, we’re trying to build a new data structure that will enable the metadata to go out of the cache to the media, and yet being able to provide very fast access to the data.

Swapnil Bhartiya: Can you also talk a bit about the evolution of metadata with the evolution of data engineer, storage technologies as well? Because when you talk about… The fact is when Facebook was created, we lived in a different world than we are living today. Not only we are creating huge amount of data, but the way, and from where we are consuming is also, and the data we are creating is also different plus what value does data have without metadata? That is also very important to understand that. You’re saying that sometimes metadata is more than data. So talk about these aspects so that we can also understand the problem area that is there that comes with metadata.

Adi Gelvan: Great question. Okay. So let’s take 20 years back. When you looked about data, you were looking about at files, pictures, videos, and the legacy data that would be in megabytes, gigabytes, or terabytes per piece. The metadata was actually the information about the data, which would normally be a couple of bytes, maybe tens of bytes, or mostly kilobytes. In the recent years, everything out there actually generates data, the location, the temperature, where you are, what you did, every smartphone is creating tons of metadata and tons of data and what you see around the world is that the data we’re dealing with now is not big pictures or big videos or big files. It’s more of JSON files and small, tiny files that comes from censoring, watches, phones, whatever any device today creates data and metadata. So, imagine that every piece of small data contains metadata.

Sometimes the metadata is the description of the data, which may be much larger than the data itself. So if 20 years ago, a few bytes of metadata could actually manage terabyte of data. Nowadays, you may have megabytes of metadata or kilobytes of metadata managing a few bytes of data. So the whole structure of data has changed. What the industry overlooked is that the technologies to manage those data. Now, if you look at data storage file system object, all these platform, they can store huge amounts of data. There is no problem. There is no scalability problem. Look at S3 of Amazon. It really stores hexabytes of data with no problem. But what about the metadata? The metadata now is growing much bigger than the data itself. The key thing about the relationship between metadata and data is that the information about the data where it’s located, when it was generated and things you need to know about the data in order to access it there in the metadata.

So if you want to access any data, you really need to access the metadata first, and you need to do it very, very fast. Now, if there are no adequate tools to manage the metadata and it will take you more time to access the metadata. Imagine what happens to the data itself. You’ll need another hop that will take you much more time and the user will suffer. So let’s talk about Facebook. They are actually the ones who developed RocksDB, which is the most prevalent storage engine in the market today with a new technology called LSM trees. These guys, their data was mainly pictures. So RocksDB was actually designed to manage small metadata of large data. When Facebook scaled and their data changed as well, what they did is they divided the problem into many small pieces. So now they make sure that the amount of metadata managed by the storage engine is not more than a couple of tens of gigabytes.

They shard the problem. Now, Facebook, they have endless resources. That’s not a problem, but normal customers who actually use this RocksDB, they cannot afford thousands and 10 and thousands of shards. So they need to store the data in relatively concentrated place. They suffer because the metadata now is very, very big. It’s not divided into many pieces and accessing the data now goes to the metadata, which takes time and they suffer. So they need to pay more resources. It takes them more time and things simply take longer and their users suffer. That’s their biggest problem.

Swapnil Bhartiya: Once again, thanks for explaining that in detail. Now, there are a couple of things that I want to talk, but number one, as you said, Facebook has all the resources, but other companies, they don’t have. So I want to talk about a couple of number one is that who consume… I would assume like any company which creates any data, through sensors or whatever it is that is the customer that’s number one. Number two is that when we look at teams within organizations, whose problem is to deal with metadata, because in these days, we look at unicorn developers or the silos are breaking down the fact that there are still silos, but I’ll talk about these two aspects. First of all, what have companies deal with it and within those companies, what kind of teams deal with metadata? Whose problem is it there?

Adi Gelvan: Yeah. So it’s like asking who is suffering from air pollution? Everybody. Everything around us is data. I don’t think I know any technology company that doesn’t rely on the data. Data is the new oil. If you are able to store and analyze data and get insights, you are stronger than you were before. You can actually make smart business decisions. When you think about the leading companies in the world, their advantage over anyone else is the ability to analyze data, to know where the user are, to know what their preferences are and to know what they do in order to offer them better service. Now, the key to this thing is metadata. If you want to analyze fast, you need to access the data very, very fast and analyze it very fast. You cannot do it if you don’t access the metadata very fast.

So the first access to any data of any system out there is to the metadata. If the metadata is very small and it resides in the memory. Great. But the data is exploding. It’s growing faster and faster, and most of the systems cannot keep the data in the memory. Customers do not want to keep all the data in the memory because memory is very, very expensive. So they need new technologies to manage the metadata efficiently. That’s what we are trying to build. If you look at the market, the storage engine market is dominated by huge companies like Google, Facebook, Apple, Oracle. We want to build the next generation storage engine that will actually allow any company out there to scale their data and manage the metadata efficiently without paying a lot of money.

Swapnil Bhartiya: Excellent. Once again, thank you. But there’s one more question that I… Second part of the question was that within teams, within an organization whose problem it becomes to deal with the data because not every company, as you said, they can afford to have data scientists or, those kind of geezers on their team. So, whose problem is that because that will lead to the second point. That would be, how do you actually help them?

Adi Gelvan: So the guys who are building the data systems, data architects, system architects, developers, who are developing applications that need to manage data, they need to embed a storage engine that will do it right. So anyone who has access to the architecture of the application, of the infrastructure, DBAs, DevOps, almost everyone who is building or is working with a data system actually depends on it. The ones who are suffering are eventually the users. So every database out there is working with a certain storage engine. That mostly is the bottleneck. It’s amazing, but the hugest database you can think of if the data engine, which is a small thin layer inside them, if it doesn’t work well, it’ll create the worst performance problems. Now, how do we help them? If you are a user, you are suffering from a problem that we are trying to solve.

We’re trying to offer those programmers, those architects, those application vendors, and database vendors, to use our storage engine and to allow them to design smarter, bigger systems that will work faster and more efficiently. So we are trying to build something that will work faster, will scale better, and we’ll do it with less resources. We think that we’re on the right path. We’ve built here on great technology that we think is able to change the way metadata is being managed in the world. Our next big thing is that we are going open source very soon. So we realize that if we will share our technology with the users, with the developers and build a huge community around it, it will enable us direct conversations and path to the developers and enable them to work with us, to build smarter systems. So open source, this is our next big thing.

We really hope we can be the de facto next generation storage engine. We will have an open source version that is going to be with Apache license, totally permissive. We are going to have an enterprise commercial version that will enable much more for the production customers. So this is actually an open core model where the open is open and valid and hopefully will be used by all the developers in the testing and environments. When you want to go into the production in our commercial version, we’ll have more tools that will enable you to go into production, support services, and much more scalability for your systems. We started with the most prevalent storage engine in the market, which is RocksDB. Today, we provide all RocksDB users, a drop-in replace for the RocksDB. If they change it’s 30 minutes, sorry, 30 seconds work, simple drop-in replace, no single line of code change. The magic happens after you change. So all RocksDB customers, we’re here for you.

Swapnil Bhartiya: Excellent. Adi, thank you so much for taking time out today. Of course not only talk about the work that you folks are doing, but also talk about the larger problem, which is there. I would love to have you back on the show, as I said, though, whenever the open source story is there, but I really appreciate your time doing, thank you.

Adi Gelvan: Happy to come again. Thanks for having me, was a pleasure.

[/expander_maker]

Why Metadata Is Key In A Highly Scalable System | Adi Gelvan, Speedb

Chainguard Launches Free Sigstore Course For Securing Software Supply Chain

Tecton Now Available On Databricks Lakehouse Platform

Chainguard Launches Free Sigstore Course For Securing Software Supply Chain

Tecton Now Available On Databricks Lakehouse Platform

You may also like

Why AI Compounds Cloud Cost Problems and How Java Runtime Tuning Fixes It | Peter Maloney, Azul | TFiR

How to Run AWS Locally and Cut Cloud Dev Costs | Waldemar Hummer, LocalStack | TFiR

How Klutch Installs Into Any Kubernetes Cluster | Julian Fischer, anynines | TFiR

Why Platform Engineering Teams Over-Abstract and How Modular Design Fixes It | Corey McGalliard, Akamai Cloud | TFiR

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

Why AI-Generated Code Needs a Cloud Sandbox to Be Trustworthy | Waldemar Hummer, LocalStack | TFiR