Trojan Source bugs enable 'invisible' source code poisoning

A pair of flaws in nearly every popular programming language enables attackers to hide malicious code in plain sight without the ability to be detected prior to compiling.

Shaun Nichols, TechTarget

Published: 02 Nov 2021

Two vulnerabilities present in nearly every high-level programming language could potentially enable a bad actor to slip malicious code into a project without being detected.

Formally known as CVE-2021-42574 and CVE-2021-42694, the two bugs are collectively referred to by the name Trojan Source. Researchers Nicholas Boucher and Ross Anderson of the University of Cambridge in the U.K. are credited with the discovery.

According to Boucher and Anderson's paper on Trojan Source, the vulnerabilities exist in the way the languages handle Unicode characters within source code. Specifically, the research team found that by manipulating the way Unicode handles instructions on right-left languages (such as English or Russian) and left-right languages (such as Arabic and Hebrew), malicious instructions could be slipped in and encoded.

"This attack exploits subtleties in text-encoding standards such as Unicode to produce source code whose tokens are logically encoded in a different order from the one in which they are displayed," the researchers wrote, "leading to vulnerabilities that cannot be perceived directly by human code reviewers."

The key to the attacks, the researchers said, is the ability to alternate between right- and left-aligned text in such a way that the actual instruction can be scrambled but will still execute after the code is compiled.

"Embedding multiple layers of LRI and RLI within each other enables the near-arbitrary reordering of strings," Boucher and Anderson wrote. "This gives an adversary fine-grained control, so they can manipulate the display order of text into an anagram of its logically-encoded order."

In other words, it's possible to create code that appears to be one instruction when read by a human, but something completely different when executed by the machine.

"We've verified that this attack works against C, C++, C#, JavaScript, Java, Rust, Go, and Python, and suspect that it will work against most other modern languages," Anderson wrote in a separate blog post.

The most obvious method of exploit for these flaws would be open source software projects. By sneaking attack code into otherwise benevolent changes to source code, criminals could target projects on code-sharing sites such as GitHub and embed legitimate software with malicious components that could steal credentials, spy on users or do any other manner of bad activities.

There is also the potential for a supply chain attack. Should an attacker get access to developer machines at a commercial software provider, they could potentially sneak their attack instructions into the source code of commercial software and, in turn, get a foothold on the networks of that company's customers.

Threat actors have already used similar techniques in action with the 2020 attack on IT services provider SolarWinds.

While Boucher and Anderson were awarded a pair of CVE designations for their Trojan Source research, there is some controversy around the paper. Critics of the duo's research charge that much of the findings have already been covered in previous research and that the technique of hiding code has been known for years.

Despite the controversy, the vulnerabilities merit attention, as several software suppliers have developed updates to address the Trojan Source bugs. Boucher and Anderson said they believe the best long-term solution for the threat will be deployed in compilers. However, the duo urged organizations to adopt additional mitigations since some compiler fixes might not be available any time soon.

"About half of the compiler maintainers we contacted during the disclosure period are working on patches or have committed to do so," the researchers wrote. "As the others are dragging their feet, it is prudent to deploy other controls in the meantime where this is quick and cheap, or relevant and needful."

Trojan Source bugs enable 'invisible' source code poisoning

A pair of flaws in nearly every popular programming language enables attackers to hide malicious code in plain sight without the ability to be detected prior to compiling.

Dig Deeper on Application and platform security

What is ASCII (American Standard Code for Information Interchange)?

Extended Binary Coded Decimal Interchange Code (EBCDIC)

internationalization (I18N)

Unicode