DevelopersDevOpsFeaturedLet's TalkOpen SourceVideo

CelerData Enables Data Engineers To Build New Analytics Projects Faster | Sida Shen

0

Guest: Sida Shen (LinkedIn)
Company: CelerData (Twitter)
Show: Let’s Talk

CelerData Cloud is an analytics service based on StarRocks, an open-source OLAP database and query engine. With accelerated query performance and pipeline-free data analytics, users can develop new analytics projects and go into production faster.

In this episode of TFiR: Let’s Talk, CelerData Product Marketing Manager Sida Shen shares his insights on the data analytics space and how their product is helping companies increase query performance and lower operating costs.

Current pain points:

  • Most of the engines available today still rely on outdated technology, or they’re still optimized for ETL (extract, transform and load) workloads. They’re not ideal for data warehouse-like low-latency queries. This forces the users to transfer and replicate their data into a proprietary data warehouse still purely for fast query processing. While this approach addresses a performance issue, it introduces unnecessary costs of maintaining separate systems and copying the data.
  • Multi-table joins are expensive and optimizing them is a big challenge. This is especially true in the field of real-time analytics because most of the real-time OLAP databases struggle to perform joins at scale. They force users to implement the normalization pipeline, which is essentially pre-computation, i.e., joining the multiple tables together into a big flat table beforehand so that the database doesn’t have to handle the join query execution.
  • Doing pre-computation 1) is extremely inefficient in terms of storage and compute, 2) adds complexity due to the need for specific technologies, e.g., Flink and those stateful stream processing tools to meet the strict freshness demands of real-time analytics, and 3) makes the system rigid. Any business change in the upstream that causes schema change on the original table requires a reconfiguration of the normalization pipeline, as well as data backfilling of all of the related data.

On data warehouses and data lakes:

  • The industry has relied heavily on data pipelines, but they often introduce more costs, complexity, and governance issues.
  • Before, a lot of the workloads needed to be extracted from the data lake to a high-performance proprietary data warehouse, going from open format to close format, and duplicating the data.
  • Data lakes are cost-efficient, scalable, and it’s a place to dump all of your data. They are ideal for high-performance demanding scenarios on the data lake.
  • Recently, there are open data lake table formats like Apache Hudi and Apache Iceberg with data warehouse-like features, including indexing and transactional properties.
  • With newer technologies and modern query engines, you can get data warehouse-like performance all directly on the data lake. You don’t have to go through that expensive data ingestion/data copying process.

On the ease of implementing data analytics now:

  • Most analytics solutions are on the cloud now.
  • In the cloud, everything’s elastic. With the pay-as-you-go cost structure of the public clouds, you can actually get started with just a few hundred bucks.
  • Newer technologies can definitely simplify the data pipeline. There’s less stuff you have to build from POC to production.
  • For data lake analytics, data lakes can do a lot more paired with modern query engines, such as StarRocks.
  • You can actually run your very demanding workloads on the data lake without actually moving your data into another data warehouse.

On CelerData Cloud and its pipeline-free data analytics:

  • CelerData Cloud is built to handle low-latency, complex OLAP queries at scale.
  • The main purpose is to deliver extreme query performance to let users have data warehouse performance on the data lake.
  • Users can run high concurrency, low latency OLAP workloads, e.g., customer-facing analytics, directly on the data lake without moving the data.
  • Data is handled by StarRocks. No more data copy. No more data ingestion. There is only one copy of the data, which is great for data governance.
  • Normalization is only done on demand for extreme cases, instead of default for every single table. That decreases overhead costs and makes analytics flexible to business changes.
  • Users don’t have to go through this painful process of developing those unnecessary data pipelines by using CelerData’s “pipeline-free” data analytics schema architecture means users won’t.

On AI and generative AI:

  • For generative AI, the vector database has been the hot topic in the past year. People use vector database as a long-term memory for large language models. The CelerData user community has made a lot of contributions to make StarRocks work as a vector database.

On the importance of having an open-source community:

  • Having a community with thousands of active users, CelerData gets first hand insights faster than everybody else.
  • CelerData can release something and there is user feedback 4 hours later.
  • All of the features they release are GA’d before testing with their seed users. The community makes that happen.

On “good technology”:

  • It should not require a lot of labor.
  • It should be easy to use.
  • It should simplify the pipeline and your process.
  • If you’re doing analytics, what is your labor doing: are they busy ingesting data into a data warehouse, instead of acquiring the data on the data lake? Or are they unnecessarily building memorization pipelines or other types of pipelines?
  • Is there any technology you can replace with newer ones so you don’t have to build those data pipelines that are expensive and labor intensive?

This summary was written by Camille Gregory.