DevelopersDevOpsFeaturedLet's TalkOpen SourceVideo

StarRocks Helps Simplify Data Pipelines For Real-Time Analytics | Li Kang


StarRocks was founded in 2020, aiming to build a world-class analytical database that could support real-time analytics and batch-based analytics workload. Since its creation, the company has gained approximately 500 companies using its products, with 120 of them being enterprise customers.

Latency can be problematic with companies wanting a shorter delay between events happening and when you can use that event to make the next business decision. Whereas it used to be updated daily, now people would like to have that information instantaneously. StarRocks combines together the latest events with the historical data to enable better decision making. The company’s real-time query performance can support over ten thousand queries per second.

In this episode of TFiR Let’s Talk, Swapnil Bhartiya sits down with Li Kang, VP of Strategy at StarRocks to discuss the company and what problem they are solving. He takes a deep dive into the use cases they are seeing and the trends they are seeing with cloud and the growth of data. Kang explains their competitors and what makes StarRocks stand out.

Key highlights from this video interview are:

  • Kang shares two of their use cases: Airbnb and a social media company. He explains that StarRocks is providing the data service layer in the latest implementation of Airbnb’s Minerva project and is also providing the advertisement data platform for the social media company. He discusses the problems they were facing and how StarRocks helps.
  • The pandemic was a catalyst for companies moving to the cloud. StarRocks was created during the pandemic and has seen the landscape change during that time. Kang feels that companies are managing to support business in the cloud more effectively, and this is driving the rise in analytics applications like StarRocks.
  • The growth of data is presenting new challenges with the need for new capabilities or engineering breakthroughs to handle the volume of data. One of the key challenges for analytics applications is not processing changes effectively due to dealing with data in batch mode and latency. However, Kang explains how StarRocks is changing that.
  • Kang discusses the competitors in the data analytics space such as ClickHouse, Apache Druid, and Apache Pinot. He explains how a breakthrough in query engine, query planning, and query optimization sets them apart from their competitors. He goes into detail about why their query performance is better and the benefits they have seen.
  • Open source lies at the heart of StarRocks and they believe it is critical to their success since it is easier for people to test it out, ask questions, and contribute. Kang explains the community they have built and how it is helping not just StarRocks but also the analytics software ecosystem as a whole.

Connect with Li Kang (LinkedIn)
Learn more about StarRocks (Twitter)

The summary of the show is written by Emily Nicholls.


Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed. 

Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya, and welcome to another episode of TFiR Let’s Talk. And today we have with us, once again, Li Kang, but this time VP of Strategy at StarRocks. Li, it’s great to have you back on the show.

Li Kang: Good to see you again, Swap.

Swapnil Bhartiya: Yeah.

Li Kang: Thanks for having me.

Swapnil Bhartiya: It’s my pleasure, and since you joined StarRocks, and this is the first time I’m talking to StarRocks. So first of all, tell us quickly about the company. What do you folks do?

Li Kang: Sure. The founding members of StarRocks were on the Doris project before, and they had the vision of bringing Doris to the next level to make a world class analytical database. So in 2020 they started StarRocks. And today, majority of the StarRocks code, actually more than 80% of StarRocks code, are newly developed by the StarRocks team. So the goal is to build a true world class analytical database that can support real time analytics, as well as a batch based analytics workload. The project started in May 2020. And in a little bit over two years, we have gained about 500 companies using our products, and about 110, 120 of them are enterprise customers. Some of our large flagship customers, including Airbnb, Tencent, Lenovo, the large, data driven internet companies in different industries.

Swapnil Bhartiya: You did touch upon some of the users and use cases, and those users are incredible. But I do also want to understand, if you can just talk about, either Apache Doris or database or from the StarRocks perspective, who are your users and what are the specific use cases where the project is being used?

Li Kang: Okay. So let me share a couple use cases with you. One is a company called Airbnb. I’m sure you’re familiar with that. They build this internal Minerva project, which is a enterprise wide metric store, right? So that’s where they build this metric store layer to enable business side, to be able to run the analytics workload without having to understand the complexity of the underlying technical data storage and access methods.

And they selected StarRocks as the driving data service layer in the latest implementation of the Minerva project. And the reason for that is they need a lot of real time integration and they were using a different product and also an open source product from an Apache project, but they were not, you know, that project couldn’t meet their requirements for flex in terms of flexibility and in terms of the total cost of ownership.

So they selected StarRocks because we can handle real time analytics. And at the same time greatly simplify the data pipeline, reduce the cost of the overall infrastructure and reduce the cost of developing and maintaining the data pipeline. So they actually gave a speech at the data and air summit earlier this week. And data recording is also available online.

And another use case is a social media company. They have over 200 million active users on a monthly base and they started their journey with, you know, on the big data platform, right hive and presto as the analytics engine. And soon they realized they couldn’t support the real time nature from their customers. And because it’s a social media company, they need to be able to make recommendations about advertisement, about product promotions and all these things. They need to do all these things in real time.

So to do a real time implementation they went with another Olap database at that time in 2019. That was because that product, also an open source Olap product, was recognized as the best Olap performance at that time. But soon they realized they couldn’t support the number of concurrent queries. The reason being that they need to support thousands or tens of thousands of merchant that doing business on their platform. So that translates to ten thousand and more query per second, right? So to support that kind of query loads, their technology at that time just couldn’t meet that requirement.

So in 2020, 2021, they started evaluating StarRocks. And eventually in 2021, they selected StarRocks as their advertisement data platform, because we can provide the best real time query performance, as well as we support over ten thousand queries per second, that constantly hitting that platform 24 by seven. So I think that those are the two very typical use cases where you need simplify your data pipeline, but provide best query performance and large number of concurrent users. That’s probably the very typical use cases for us.

Swapnil Bhartiya: Excellent. Thanks for sharing these two big use cases. Now if I ask, since the company came into existence, first of all the interesting thing is, in today’s world two years can be seen a large- It’s like almost a century or two years is like literally nothing, but because so many things change ever since the company was launched. So can you also talk about how the whole landscape has changed? Because I think the company was created just when the pandemic hit and everybody started to rush to Cloud Native and Cloud, which also means that there was a spike in user and use cases and everything is data driven in today’s world. So talk about how things have changed ever since the company was created, not only in terms of business needs, but also the volume of data that we are creating.

Li Kang: Yeah. The Cloud, right? All these businesses are moving to the Cloud and all the companies have been continually investing into this online or digital available, right? So even last two years or three years during the pandemic period, you saw that the internet companies or businesses have been booming and from the technology standpoint, being able to support business in the Cloud more effectively and efficiently and helping them to make decisions in real time or near real time and can provide them a competitive advantage because now everybody is doing business online.

So everybody has all these transaction information, user behavior information available, but how do you make use of that information to drive your business decision? And the freshness of the data and the timeliness of that decision making, that is becoming more and more critical, and for companies to gain a competitive advantage.

And that’s what we are seeing with our clients. They’re asking for a shorter delay between the event is happening to the time that we can use that event to help making the next business decisions. That latency, people want that to reduced from used to be daily based, maybe, but now we want to be instantaneous, right? That’s why you are seeing the rise of analytics application. The idea is you are combining the latest events happening in your business with all the experience you have collected in your historical data, combine them together, make the most educated, the most effective business decision. So I think that’s happening across many different industries. And that’s why we are seeing StarRocks is being adopted by more and more customers these days.

Swapnil Bhartiya: Excellent. Can you also kind of talk about, with this growth, with this adoption, not just the growth of Cloud Native, the adoption of Cloud, but also the growth of amount of data that we are creating, because as you’re talking, every company is collecting or creating some kind of data. So the volume of data is also growing. So can you also talk about what kind of new challenges that you are seeing are there now, versus when the company was created, because that also shows how the company itself is evolving to address that need and demand?

Li Kang: Right. So for example, we are continuously adding new capabilities or creating new engineering breakthroughs. The challenges have been facing the old lab analytics world is, we’re used to dealing with data or dealing with these information in the batch mode, right? Like we analyze it on a daily base, or even sometimes multiple days, right? People don’t like that kind of latency anymore but the challenge is, you have the incoming stream of data from like your IOT devices, from your wearable devices, from your web events, but things are changing, right? These are more transaction based. You may have an order being placed, but then you may have order being updated and order is returned. How’s that going to affect the next recommendation to the consumer.

So that used to be a challenge for analytics application, because all the analytical applications are good with new data, but have not been able to processing the changes. We’re not able to process changes very effectively or efficiently and at StarRocks we spend a lot of time on different technologies with different engineering breakthroughs to make sure we can handle those changes and deletes effectively. And we can reflect that in your analytics decision making process immediately. That is a breakthrough from StarRocks and that helped our customers. For example, when those in the healthcare monitoring industry, they’re getting these updates all the time and they want to make sure that information is reflected on the user side immediately. And we were able to help them reach that goal.

Swapnil Bhartiya: Excellent. Now earlier as you’re talking about Airbnb, other folks who were kind of adopting your solutions, I’m also kind of curious if you can talk about who are other players in the space, who you guys directly compete with and what edge you have over others that, these players, these vendors choose you over them?

Li Kang: Sure, yeah. The real time analytics are still new. Not new, but is still growing, right? Relatively young I would say. A lot of people have been trying- Some people don’t even realize the benefit of real time analytics and other companies have realized it, but have been struggling because of the technology limits. There are other players in this field as well such as ClickHouse, Apache Druid, Apache Pinot. Those are all great technologies. They’re all open source products. And we believe that we can work together and grow this business together. In terms of our advantage I would say, number one is we had this breakthrough in the query engine and query planning and query optimizing in that area. So that we address a key challenge in this, in the Olap world, which is a well known term by all the data engineers or infrastructure engineers, it’s called de-normalized wide table problem.

The idea is when you have data stored in the relational database, in multiple tables, and when you query them, it used to be very difficult to get a good query performance on those multi table joints, when the data volume is large. And when the query is complicated, analytical queries, where you need to run a lot of aggregations and computation. So pretty much all the other products I mentioned, they took this approach, which is called de-normalized wide table. Basically instead of a query against the joint tables directly, the first step is to bring them into a one table where you have all the columns flattened into this one table. Now you are breaking the normalization rule. You’re breaking the relational relationship constraints right in the RDBMS model. But what you get is the query performance on that wide de-normalized table.

And that’s a ClickHouse, Druid, or Pinot, all took that approach. And our breakthrough is, through our optimization, our query engine, our parallel execution leveraging the modern hardware structure, we are able to query the multi table joints much more efficient. So our query performance is much, much better when it comes to multi table joint works and that, because you don’t have to build a de-normalized table in probably 80% of the scenarios, that greatly simplifies the data pipeline. And it simplifies the effort to build the whole analytics platform. And that’s the main reason Airbnb select selected us, in the new version of their Minerva project, because now suddenly before us, they have to build many, many de-normalized tables. And that data pipeline was very hard to maintain. And the delay between running those, populating those tables, so it was very long time.

So now with StarRocks, they reduced 80% of those data pipelines, reduced 80% of those de-normalized tables. And that makes it much easier to maintain and much cheaper to maintain and to run that platform and better credit performance, and much faster to respond to any business changes. I would say that probably is the number one differentiator, and that fundamentally changes how you do real time analytics. From there we also added other capabilities. For example, our materialized view is much more smart, much more advanced. We can update that with your real time data stream. Our resource management can prevent the platform from any badly planned queries. Those queries won’t take down the whole cluster. Those are a lot of other enhancements as well, but fundamentally we change the way how you do real time analytics by eliminating the need of de-normalized wide table. I think that will definitely resonate with many data engineers and data infrastructure engineers.

Swapnil Bhartiya: Can you also talk about how important is Open Source for StarRocks? And if you can also mention how you’re involved with the project, your contributions there?

Li Kang: Yeah. Open source is definitely very, very important, critical to our success. We truly believe the new technology should be open sourced. And the reason is, one is it’s easier for people to evaluate, to test out. But I think more importantly is being open. Have your code available on GitHub and people can examine your code, can even contribute to your code. This is fundamental for a high quality software product. Everybody can look at your source code, can criticize it, can ask a question and if they have great ideas, they can even contribute back to the scope.

So we have a community of developers from different companies are helping us developing new features and also in fixing bugs, improving the products. We work as truly in the community sense. And we are also contributing back to other projects. We use other open source projects, as tooling, for software development and even some fundamental building blocks. So I think having a very healthy, open source community and even ecosystem, the project that supporting each other and grow together is critical for the success of not just StarRocks, but also the overall analytics software ecosystem.

Swapnil Bhartiya: Right. It’s like a positive sum or win-win game. Everybody wins in this game of open source. So there’s nothing better than that.

Li Kang: Yeah. Very well said. Yeah, absolutely.

Swapnil Bhartiya: I think I have everything. Is there anything else that you think we should have talked about it or- I mean, we know about the company. We know about the project. We know about your users. We know about the use cases. We know why people use it. We also talk about your Open Source angle. So we talked about the challenges there. So anything else, or you think we can wrap it up?

Li Kang: Let’s just on that note of the Open Source, we welcome all the developers. Visit us on GitHub. Join our slack channel. We make contributions and ask questions and we’ll be running more community events soon, and we’ll send out invites and we welcome all the developers. Help us grow this project together.

Swapnil Bhartiya: Li, thank you so much for taking time out today and not only talk about the project, the company, but also the larger problem that you’re solving with real time. As you said, this is not a- It’s kind of solve problem, but the challenges are evolving. So is the need for new solutions. So thanks for sharing those. And as usual, I would love to have you back on the show. And as you said you’ll be involved with more committee events and you’ll be organizing things. So I also hope to see each other more often and would love to collaborate with you folks, too. Thank you.

Li Kang: Thank you for having me. Have a good day as well.