In a blog post published Monday, Lasso Security said the exposed API tokens gave its researchers access to 723 organizations' GitHub and Hugging Face repositories, which contained high-value data on large language models (LLMs) and generative AI projects. Hugging Face, a data science community and development platform, says it hosts more than 500,000 AI models and 250,000 data sets.
According to Lasso Security, the exposed API tokens left organizations' GenAI models and data sets open to a variety of threats, including supply chain attacks, poisoning of training data and theft of models. Bar Lanyado, security researcher at Lasso, wrote that 655 organizations' tokens had write permissions, which gave the researchers full access to the repositories.
Some of the repositories that were open to full access were for platforms and LLMs such as the open source Meta Llama 2, EleutherAI's Pythia and BigScience Workshop's Bloom.
"The gravity of the situation cannot be overstated. With control over an organization boasting millions of downloads, we now possess the capability to manipulate existing models, potentially turning them into malicious entities," Lanyado wrote in the blog post. "This implies a dire threat, as the injection of corrupted models could affect millions of users who rely on these foundational models for their applications."
In a statement to TechTarget Editorial, Hugging Face said all exposed API tokens have been revoked, but the company appeared to put the blame primarily on customers. "The tokens were exposed due to users posting their tokens in platforms such as the Hugging Face Hub, GitHub and others," the company said. "In general, we recommend users do not publish any tokens to any code hosting platform."
However, Lanyado wrote that Hugging Face bears responsibility as well, and recommended that it continually scan for exposed API tokens and either revoke them directly or notify users. "Organizations and developers should understand Hugging Face and other likewise platforms aren't taking active actions for securing their users exposed tokens," he wrote.
Lanyado credited several organizations with fast responses to Lasso Security's findings. "Many of the organizations (Meta, Google, Microsoft, VMware, and more) and users took very fast and responsible actions, they revoked the tokens and removed the public access token code on the same day of the report," he wrote in the blog post.
Hugging Face said it is working on measures that will better prevent other exposures in the future.
"All Hugging Face tokens detected by the security researcher have been invalidated and the team has taken and is continuing to take measures to prevent this issue from happening more in the future, for example, by giving companies more granularity in terms of permissions for their tokens with enterprise hub and detection of malicious behaviors," the company said in its statement. "We are also working with external platforms like GitHub to prevent valid tokens from getting published in public repositories."
Searching for API tokens
With the rapid rise of LLMs and GenAI models, Lanyado said Lasso Security wanted to take a closer look at the security of Hugging Face, which he said was "a critical platform for the developer community." The researchers decided to scan code repositories on both Hugging Face and GitHub for exposed API tokens using the platforms' search functionality.
Lanyado said the researchers ran into obstacles while searching code by regular expressions (regex); the initial search produced only the first 100 results on GitHub. The researchers then searched for HuggingFace API tokens regex for both users and org_api tokens, which returned thousands of results. However, they could read only 100 of those results.
"To overcome this obstacle, we had to make our token prefix longer, so we have brute forced the first two letters of the token to receive fewer responses per request and therefore receive access to all of the available results," he wrote.
Exposed API tokens were even more difficult to scan for on Hugging Face, Lanyado said, as the platform did not allow searches by regex. Instead, the researchers searched for API tokens by substrings.
After scanning repositories on both platforms, the researchers used the "whoami" HuggingFace API call, which gave them not only the individual token's validity but also the user's name, email, organization and the permissions, privileges of the token and other information.
The researchers found another issue related to Hugging Face's org api tokens. The company had previously deprecated those tokens and also blocked their usage in its Python library by checking the token type in the login function. However, Lanyado said that by making "small changes" for the login function in the library, the read functionality for org_api tokens still worked.
Even though the tokens had been deprecated, researchers found they could use exposed org_api tokens to download private models from repositories. As an example, Lanyado said researchers gained the ability to read and download a private LLM model from Microsoft.
In light of the exposures, Lanyado recommended organizations apply token classifications and avoid any hard-coded tokens while performing code reviews for GenAI projects and LLMs. "In a rapidly evolving digital landscape, there's a major significance of early detection in preventing potential harm in securing LLMs demands."
Rob Wright is a longtime technology reporter who lives in the Boston area.