The Open Source Initiative (OSI) is driving a multi-stakeholder process to define “Open Source AI”. The goal is to establish a shared set of principles that foster permissionless, pragmatic, and simplified collaboration for AI practitioners, similar to what the Open Source Definition (OSD) has done.
In this episode of TFiR: Let’s Talk About AI recorded at the 2023 Open Source Summit in Bilbao, Spain, OSI Executive Director Stefano Maffulli shares his insights on the current state of affairs with regard to generative AI and how the OSI is helping define and steer the direction of open source AI.
On Open Source Definition:
- OSD came out 25 years ago. Through those years, software consumption, distribution, and execution have changed.
- OSD was designed to work within the copyright law, which is a fairly uniform, common law that looks similar worldwide.
- With AI systems, there are data dependencies. Data is covered by different laws (copyright, privacy, labor, contract, terms of services, etc.) and has many different pieces that make the whole scenario completely different.
On Open Source AI:
- One of the main drivers for this research on what constitutes open source AI is the current confusion in the market.
- The set of components that go into an AI system is completely different from what goes into an operating system. Maffulli states that the safeguards and values that have enabled fast evolution and innovation of computer science need to be reflected in AI.
- The work to understand Open Source AI is to provide more knowledge for the community to influence policies. It’s not the job of the open source movement to write licenses.
- The terms of service of any entity should not be allowed to say, “Come to my community/website, contribute your knowledge for free, and then I’m going to be exclusively capable of signing a deal to resell that information, that knowledge that you have created for free to a third party.”
- Policies have to be in place to defend collective creations, instead of making it proprietary and handing it to someone else.
- At the Birds of a Feather conversations, OSI is tackling crucial questions, such as what is the dependency of the original training dataset? Is the original dataset the equivalent of source code? Can model itself be the preferred form to make modifications to an AI system?
- The complexity of the conversations is daunting, but they’re making good progress on the principles. The basic principles that they want to see reflected in an open source definition for AI are self-sovereignty over data and code as well as the ability to innovate without fear of retribution or without fear of having to ask for permissions at every single step.
- The implementation phase is going to be more challenging, because as a community of open-source developers, they haven’t really paid much attention to data. There was a very clear separation between the code you wrote, and the compiler you used, what license they had, the developer environment, and where copyright starts and ends. With data, it gets a little bit more confusing.
Current state of the market, including user and vendor ecosystem with regards to generative AI:
- There is a lot of confusion, hidden agendas, and fear, in general.
- There is a rush to deliver legislation and regulation around generative AI and systems.
- It’s not very clear what can be done, because there are technical limitations today. For example, if a model has been trained and it contains private information, there is no easy way to remove the private information without retraining the whole system. It takes megawatts to train the systems, so it is very costly. Are there other technical solutions in place?
- There is a strange convergence of people (content creators, artists, comic designers, etc.) who are upset about the privatization of their content by people who have released software under the general public license (GPL).
- There is also the push to increase the reach of copyright to say, “You cannot use my blog post, pictures, and content without my permission,” which Maffulli says can lead to a very nasty side effect.
- The side effect that Maffulli sees is that only large corporations like Meta, Google, Microsoft, Amazon, or large governments or bad actors are going to be able to either enter into commercial agreements with the larger aggregators of content, such Reddit or Getty Images. They will be the only ones who are able to assemble large data sets, and therefore train large models, because they have the power to do so.
- Smaller groups like EleutherAI will not be able to do it, or they will be sued out of existence. And because they disclose all of their sources, they expose themselves to takedown requests, which are starting to appear.
- AI developers are hyperaware about the impact of the software or the systems that they’re developing and seem to be afraid of its capabilities. They preemptively try to write licensing documents and agreements that contain disclaimers, e.g., “I’m releasing this, I need to do it because I’m a scientist, I need to publish my knowledge and the research that I’m doing. At the same time, I’m aware that this thing could put me in dangerous positions, e.g., this gets out of control and provokes a massive spam campaign or something like that.”
This summary was written by Camille Gregory.