Iterative recently announced DataChain, a new open-source tool which provides a framework for streamlining the management and processing of unstructured data, enhancing its usability for AI models. In this show, Dmitry Petrov, Co-Founder & CEO of Iterative, talks about the challenges of managing unstructured data in AI development and how DataChain is helping bridge the gaps that current tooling in the market offers.
Petrov goes on to discuss his predictions for AI and gives us a glimpse of the future developments he sees for DataChain. He says, “First, we need to solve the pain points of AI/ML engineers, and then we can get higher in the stack and solve problems of the businesses.”
How DataChain is helping developers extract value from unstructured data
- DataChain aids teams and developers in leveraging AI models and extracting value from unstructured data. Petrov elaborates on the main challenge engineers face: working with raw data such as PDFs and videos, which involves time-consuming processes to extract and prepare high-quality data for use in AI models.
- Petrov describes Iterative’s development of tools for machine learning and AI engineers, starting with Project DVC for data version control. While DVC was effective for managing unstructured data, the need for tools to curate high-quality datasets became evident, leading to the creation of DataChain.
- DataChain is a framework for data curation in AI, designed to handle large volumes of unstructured data. It integrates with LLMs to improve the efficiency of data processing and preparation for AI tasks.
- Petrov talks about the impact and adoption of generative AI (GenAI).He reflects on how AI has evolved, becoming more accessible and understandable to the general public over the past few years.
What are the challenges in working with unstructured data and the reasons behind open sourcing DataChain?
- Advanced teams focus intensely on preparing and curating data to achieve better model performance and the DataChain supports this by providing effective tools for data management.
- Petrov explains the challenges associated with managing unstructured data compared to structured data. He describes the limitations of current tools for handling raw data like images and PDFs and how DataChain addresses these issues.
- Petrov explains the decision to open-source DataChain and its implications for the industry. He believes that open-sourcing reflects their commitment to the open-source community and ensures that the tool is accessible to a broader audience.
Predictions for AI and the anticipated developments for DataChain
- Petrov predicts that AI will follow a trajectory similar to earlier advancements in machine learning. He expects that after the initial excitement and breakthroughs, the focus will shift to engineering and practical applications.
- Petrov believes that this transition will lead to more refined and reliable AI solutions, benefiting society through improved tools and applications.
- Petrov gives us a glimpse into the future of DataChain, emphasizing that the current phase is just the beginning. He expects significant developments in managing unstructured data and enhancing AI infrastructure.
Guest: Dmitry Petrov (LinkedIn)
Company: Iterative (Twitter)
Show: Let’s Talk
This summary was written by Emily Nicholls.





