Transformer neural networks are shaking up AI
Transformers are revolutionizing the field of natural language processing with an approach known as attention. That's just the beginning for this new type of neural network.
A transformer is a new type of neural network architecture that has started to catch fire, owing to the improvements in efficiency and accuracy it brings to tasks like natural language processing. Complementary to other neural architectures like convolutional neural networks and recurrent neural networks, the transformer architecture brings new capability to machine learning.
"Transformers move us beyond AI applications that find patterns within existing data or learn from repetition, into AI that can learn from context and create new information," said Josh Sullivan, founding team and leader at Modzy, a ModelOps platform. "I think we're only seeing the tip of the iceberg in terms of what these things can do," he said.
In the short run, vendors are finding ways to weave some of the larger transformer models for natural language processing (NLP) into various commercial applications. Researchers are also exploring how these techniques can be applied across a wide variety of problems, including time series analysis, anomaly detection, label generation and optimization problems. Enterprises, meanwhile, face several challenges in commercializing these techniques, including meeting the computation requirements and addressing concerns about bias.
And in the long run? Like Sullivan, most experts agree that current efforts just scrape the surface of how transformer neural network architectures will be applied in the future.
Transformer debut
Interest in transformers first took off after Google researchers reported on a new technique that used the concept of "attention" in translating languages. We will dive into attention a bit more deeply later, but at a high level, attention refers to the mathematical description of how things (e.g., words) relate to, complement and modify each other. Google developers highlighted this new technique in their seminal 2017 paper, "Attention Is All You Need," which showed how a transformer neural network was able to translate between English and French with more accuracy and in only a quarter of the training time than other neural nets.
Researchers have continued to develop NLP models like Google BERT and OpenAI's GPT 3.5, which have significantly pushed the accuracy, performance and usability of various natural language tasks like understanding text, performing sentiment analysis, answering questions, summarizing reports and generating new text.

The sheer amount of processing power that Google and OpenAI have thrown at these models is often highlighted as a major factor in their limited applicability to the enterprise. But now that these transformer models have been built, they could dramatically reduce the efforts of teams interested in refining these models for other applications, said Bharath Thota, partner in the analytics practice of Kearney, a global strategy and management consulting firm.
"With the pre-trained representations, training of heavily engineered task-specific architectures is not required," Thota said. Transfer learning techniques will make it easier to reuse the work by companies like Google and OpenAI.
Transformers vs. CNNs and RNNs
Transformers combine some of the benefits traditionally seen with convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the two most common neural network architectures used in deep learning. An analysis published in IEEE Access in 2017, the year the transformer debuted, showed that CNNs and RNNs were the predominant neural nets used by researchers, accounting for 10.2% and 29.4% respectively of papers published on pattern recognition, while the nascent transformer model was then at 0.1%.
CNNs have been widely used for recognizing objects in pictures. They make it easier to process different pixels in parallel to tease apart lines, shapes and whole objects. But they struggle with ongoing streams of input like text.
Natural language processing tasks were commonly built using RNNs, which are good at evaluating ongoing streams of things that include strings of text. The technique works well when analyzing the relationship between words that are close together, but it loses accuracy in modeling the relationship between words at either end of a long sentence or paragraph.
To overcome these limitations, researchers glommed onto other neural network processing techniques, such as long short-term memory, to increase the feedback between neurons. The new techniques improved the performance of the algorithms, but did not do much to improve the performance of models used to translate longer text.
So, researchers began exploring ways of connecting the neural network processors to represent the strength of connection between words. The mathematical modeling of how strongly words are connected was dubbed attention. At first, the idea was that attention was something added on as an extra step to help organize a model for further processing by an RNN model. The Google researchers discovered they could achieve better results by throwing out the RNNs altogether and improving the way the transformers modeled the relationship between words.

"Transformers use the attention mechanism, which takes the context of a single instance of data, and encodes its context …, capturing how any given word relates to other words that come before and after it," said Chris Nicholson, CEO of Pathmind, a company applying AI to industrial operations.
Transformers look at all the elements (such as all the words in a sequence) at one time, while also paying closer attention to the most important elements in the sequence, explained, Veera Budhi, global head of cloud, data and analytics at Saggezza, a global technology solutions provider and consulting firm. "Previous approaches could do one or the other, but not both," he said.
This gives transformers two key advantages over other models. First, they can be more accurate because they can understand the relationship between sequential elements that are far from each other. Second, they are fast at processing a sequence since they pay more attention to its most important parts.
An evolution in language modeling
At this point, it's helpful to take a step back to consider how AI models language. Words need to be transformed into some numerical representation for processing. One approach might be to simply give every word a number based on its position in the dictionary. But the approach does not capture the nuances of how the words relate to each other.

"This doesn't work for a variety of reasons, primarily because the words are represented as a single entry without looking at their context," said Ramesh Hariharan, CTO and head of data science at LatentView Analytics, a data science consultancy.
Researchers discovered they can provide a more nuanced view by capturing each word into vectors to describe how a word relates to other things. In a simple case, a word like "king" might be expressed as a two-dimensional vector representing a male that is a head of state. In actual practice, researchers have found ways to describe words with hundreds of dimensions representing a word's closeness to the meanings and use of other words. For example, the word "line" could mean the shortest distance between two points, a scheduled sequence of train stops or the boundary between two objects.
Before the Google breakthrough, words were encoded into these vectors, and then processed by the RNNs. With transformers, the process of modeling the relationship between words became the main event. It allowed researchers to take on tasks for disambiguating the significance of words that may have different meanings -- or polysemes.
For example, humans have no trouble distinguishing what "it" means in the following:
The animal did not cross the road because it was too tired.
The animal did not cross the road because it was too wide.
Before transformers, RNN models struggled with whether "it" was the animal or the road. Attention made it easier to create a model that strengthened the relationship between certain words in the sentence, for example "tired" being more likely linked to an animal, while "wide" is a word more likely linked to road.
"Attention is a mechanism that was invented to overcome the computational bottleneck associated with context," Hariharan said. These complex neural attention networks not only remember the nearby surrounding words, but also somehow connect words that are much further away in the document, encoding the complex relationships between the words. The upshot is that attention provides an upgrade to traditional standard word vectors used to represent words in complex hierarchical contexts.
Transformer applications span industries
Transformers are enabling a variety of new AI applications and expanding the performance of many existing ones.

"Not only have these networks demonstrated they can adapt to new domains and problems, they also can process more data in a shorter amount of time and with better accuracy," Modzy's Sullivan said.
For example, a team of Google DeepMind researchers recently published preliminary results in the journal Nature on the latest iteration of AlphaFold, a new transformer technique for modeling how amino acid sequences fold into the 3D shapes of proteins. Nature declared, "It will change everything," venturing that AlphaFold will vastly accelerate efforts to understand how cells are configured and discover new drugs.
Nicholson said transformers can work with virtually any kind of sequential data -- genes, proteins in a molecule, code, playlists and online behaviors, such as browsing or likes or purchases. Transformers can be used to predict what will happen next, or to analyze what is happening in a specific sequence. They have the potential to extract meaning from gene sequences, to target ads based on online behavior or to generate working code.

In addition, transformers can easily be used in anomaly detection, where applications range from fraud detection to system health monitoring and process manufacturing operations, said Monte Zweben, CEO of Splice Machine. Transformers make it easier to tell when an anomaly arises -- without building a classification model that needs labeling -- because the transformer can detect the difference between the new representation created by the transformer and others out there. It can also flag an event that falls outside of normal boundaries in a way that is not apparent when looking at the initial way of representing events.

Saggezza's Budhi expects to see many enterprises fine-tuning transformer models for NLP applications that include:
- healthcare: analyzing medical records;
- law: analyzing legal documents;
- virtual assistants (e.g., Siri): making them better at understanding our intentions;
- marketing: finding out how a user feels about a product using text or speech; and
- medical: analyzing drug interactions.
Challenges
The number one challenge in using transformers is the sheer size and processing requirements to build the largest models. Although the models are being offered as commercial services, enterprises will still face challenges in customizing the hyperparameters of these models for business problems to produce appropriate results, Thota said.
Pre-trained transformer models
Large-scale pre-trained transformer models are provided to developers via these three fundamental building blocks:
- Tokenizers, which convert raw text to encodings. This is the process of converting the raw text into a form that can be processed by algorithms.
- Transformers, which transform the encodings into contextual embeddings. This is the process of converting the encoded format into vectors representing how the encoding relates to related concepts.
- Heads, which use contextual embeddings to make a task-specific prediction. These are the hooks programmers use to connect predictions about a word to likely next steps, such as the completion of a sentence.
Also, many of these models come with various limitations for enterprise developers. For example, Google's BERT only takes 512 characters as input to classify the text. Therefore, in the case of document classification where text is greater than 512 characters, it either must be trimmed before use or the text needs to be divided into two inputs or more to incorporate full text.
Hariharan cautioned that teams also need to watch for bias in using these models. "The models are biased because all the training data are biased in some sense," he said. This can have serious implications as enterprise work on establishing trusted AI practices.
Teams also need to be alert to types of errors different from the ones they might have observed with other neural network architectures.

"These are still early days, and as we start using transformers for larger data sets and for forecasting multiple steps, there is a higher chances of errors," said Monika Kochhar CEO and co-founder of SmartGift, an online gifting platform.
Research has shown that many of these models end up wasting attention to track connections that are never used. As AI scientists figure out how to get around these constraints, the models will become faster and simpler.

Scott Boettcher, vice president of data intelligence at NTT DATA Services, said his team has been grappling with validating the accuracy of new models that underpin their virtual agents. Although the NLP applications they build have improved, the models still cannot actually reason -- and subtle changes in context can trip them up in sometimes laughable ways, but with consequences that can cause problems. The word credit, for example, is used by customers interested in opening a credit card; it's also used when seeking a credit to an account.
"If our virtual agent gets on the wrong track because of a context shift, our customer may not detect they are suddenly being steered improperly," Boettcher said. The virtual agent's responses can seem very convincing and be mistaken for valid responses. "This presents a complex, new testing challenge for the training of our virtual agents."
A generative future
There is a lot of excitement about using transformers for new types of generative AI applications. OpenAI's GPT-3 transformer has showcased some promising ways of generating text on the fly.

But why stop at text? David Glazer, analyst and research lead at Info-Tech Research Group, a technology advisory service, expects transformers to open use cases in engineering.
Transformers could make it easier to extend techniques like generative adversarial networks used for creating fake people, and into other domains, like bridge building. Once an algorithmic interface can effectively understand what kind of bridge an architect is looking to design based on his or her requirements, it can suggest different bridge designs to fit that need. The adversarial part would not be about testing whether the generated output looks like a real person, but whether a new bridge design would support the desired load at the right cost.
"This concept applies to many different design disciplines, and transformers have the ability to dramatically accelerate the movement toward generative AI," he said.