Getty Images

Tip

The importance and limitations of open source AI models

Despite rising interest in more transparent, accessible AI, a scarcity of public training data and compute infrastructure presents significant hurdles for open source AI projects.

Chris Tozzi

By

Chris Tozzi

Published: 07 Feb 2024

Because AI models differ in some key respects from other types of software, applying open source concepts and practices to AI models isn't always a straightforward task. It's one thing to create open source code; it's another to actually train and deploy an AI model.

Some of the most popular AI engines, such as the GPT model series underlying OpenAI's ChatGPT, have recently come under fire for being proprietary technology that operates as a black box. Amid these debates and the costs associated with proprietary AI, there is growing interest in open source AI models, which promise greater transparency and flexibility.

What is an open source AI model?

Anyone can freely access, modify and share the source code for open source software. In contrast, the source code for closed-source software -- also known as proprietary software -- is private, and only the original creators can alter and distribute it.

Similarly, an open source AI model is one whose source code is publicly available, meaning that anyone can download, view and modify the raw code that powers the model's algorithms. This accessibility ensures a level of transparency and customizability that closed-source models lack.

Many of today's best-known generative AI tools, including ChatGPT and Midjourney, are closed source, but there is also a growing open source generative AI ecosystem. Popular open source large language models available today include Meta's Llama models and models from French startup Mistral AI.

Challenges in developing open source AI models

The source code that powers an AI model is similar in many ways to the source code for any other algorithm or application. AI models can be built in any programming language, using the same development processes as for standard applications. In this sense, creating an open source AI model is straightforward.

Where matters get tricky for open source AI developers, however, is the model training stage. Training AI models requires access to vast amounts of data, and few open source AI companies or projects release that data publicly.

For example, in response to a question on a Hugging Face forum about training data for the Mistral 7B LLM, Mistral AI stated, "Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field." In other words, the company releases its models' code, but not the data used to train those models.

This lack of publicly available training data for open source AI models creates two main challenges:

Developers seeking to modify and retrain a model might struggle to do so. Although changing the model's source code is relatively straightforward, lack of access to the training data prevents developers from retraining their altered model on the original data set.
Lack of transparency about a model's training data can obscure the model's operations, even when the model's source code is viewable. Because training data plays a major role in the inferences a model generates, open sourcing a model's code without also sharing the training data only provides partial transparency into the model's inner workings.

These challenges are unique to open source AI projects, as other types of applications don't require training. The need for training, coupled with the difficulty of finding publicly available open source data, is a significant technical hurdle that sets open source AI development apart from other areas of software.

The role of fine-tuning

Many open source models can be fine-tuned, which can alter how they operate. This usually involves modifying parameters and weights without fully retraining the model. Although fine-tuning offers some opportunity for customization, it's different from modifying the core code of the model itself or having complete control over the training data.

Where to get training data for open source AI

The fact that open source AI companies and projects rarely release their data sets doesn't mean that developers seeking to build, or build upon, open source AI models are totally out of luck. Those looking to develop or enhance their own AI models can acquire their own training data and retrain the model as needed.

One approach is simply to use the vast volumes of content available on the web for training purposes. This could lead to copyright challenges, however, as evidenced by recent lawsuits from content owners such as The New York Times alleging that AI developers trained on their data without permission. Plus, scraping and preparing extensive web data for training purposes is tremendously challenging; there is a massive amount of data to deduplicate, for example.

Another strategy is to use open data sets, such as those listed in this GitHub reference, which are licensed for training purposes. The drawback is that this data might not be sufficient in volume or scope for training, depending on the model's intended purpose.

The compute issue

Even when open source AI developers can obtain training data for their models, the actual training process requires enormous amounts of compute. Those resources cost money -- something that most open source projects don't have in abundance, if at all.

Some open source AI companies, including Mistral, have managed to address this problem by raising funding, which they use in part to pay for model training. But this approach might not be feasible for smaller-scale open source projects or companies that want to use open source AI, but can't justify investing large sums of cash in model training.

Comparison of the differences between open source and closed source, or proprietary, AI.

The future of open source AI models

In light of these challenges, developers and organizations that want to use open source AI will likely need to settle for pretrained models from companies such as Meta and Mistral for the time being. The inability to retrain those models restricts their transparency and flexibility, presenting limitations not typically encountered with other types of open source software.

This could change if organizations with deep pockets invest in creating open source data sets tailored for AI training, while also providing the compute resources necessary to perform that training. But for now, the challenges associated with model training mean that access to open source AI models doesn't typically confer all the advantages associated with other types of open source software.

Chris Tozzi is a freelance writer, research adviser, and professor of IT and society who has previously worked as a journalist and Linux systems administrator.

Dig Deeper on AI technologies

Part of: Open source AI: Everything you need to know

Up Next

The importance and limitations of open source AI models

Despite rising interest in more transparent, accessible AI, a scarcity of public training data and compute infrastructure presents significant hurdles for open source AI projects.

A look at open source AI models

Open source AI models have advantages over generative AI services offered by major cloud providers. But enterprises have to weigh the benefits against the costs.

Compare proprietary vs. open source for enterprise AI

Unsure whether to choose proprietary or open source AI for your enterprise deployment? Compare the pros and cons of both software models, including how each can benefit businesses.

Does AI-generated code violate open source licenses?

Unresolved questions about how open source licenses apply to AI models trained on public code are forcing users, vendors and open source developers to navigate legal gray areas.

Search Business Analytics

Synthetic data vs. real data for predictive analytics
Synthetic data helps simulate rare events and meet privacy compliance, while real data preserves natural variability needed to ...
7 predictive analytics skills to improve simulation modeling
Predictive analytics skills such as statistical analysis, data preprocessing and model evaluation can help data professionals ...
Knime updates framework for agentic AI development
The open source analytics vendor is keeping up with competitors by providing features aimed at enabling users to create ...

Search CIO

9 common risk management failures and how to avoid them
As enterprises rework their business models and strategies to meet various new challenges, risks abound. Here are nine risk ...
Traditional vs. enterprise risk management: How do they differ?
Traditional risk management and enterprise risk management are similar in their aim to mitigate risks that can harm a company. ...
Domestic manufacturing policy emphasizes U.S. tech, products
Bringing manufacturing back to the U.S. might be a lofty goal for some products, but companies like Apple are making moves to ...

Search Data Management

Informatica adds MCP support, spate of AI-fueled features
With Model Context Protocol helping standardize how enterprises develop and deploy agents, support for the open standard is ...
What is data lineage? Techniques, best practices and tools
Organizations can bolster data governance efforts by tracking the lineage of data in their systems. Get advice on how to do so ...
Collibra's acquisition of Deasy targets unstructured data
With AI development on the rise, the vendor's latest purchase better enables customers to combine the complete array of relevant ...

Search ERP

6 benefits of using low-code ERP
Using low-code ERP can result in easier user training and more agility, among other benefits. Learn more about how the software ...
Ultimo adds digital labor to org chart, EAM system
The EAM vendor is building out a digital workforce at 'light speed' to become an AI-first business. It also wants to help ...
8 ways ERP software can improve customer service
By integrating sales, inventory and shipping data, ERP software helps companies avoid delays and stockouts. Learn more about how ...

Close