What do NLP benchmarks like GLUE and SQuAD mean for developers?

AI models for various language understanding tasks have been dramatically improved due to the rise in scale and scope of NLP data sets and have set the benchmark for other models.

The recent rise in the scale and scope of NLP data sets has catalyzed significant improvements in AI models for various language understanding and generation tasks. These data sets are also being used to benchmark the performance of different models against a standard set.

These models can help inform the development of practical applications in enterprises for things like better chatbots, better summarization tools and improved digital assistants. But enterprises need to proceed cautiously. A model that has made it to the top of the leaderboard isn't necessarily always practical for a given use case.

Examining benchmarks provides a frame of reference for system performance, linguistic ability and parameters like speed.

The evolution of benchmarks

"Slowly but steadily, the evolution of benchmarks such as GLUE, SQuAD and RACE have helped text and sentence interpretation to become more human," said Satyakam Mohanty, vice president at global IT services company L&T Infotech.

These benchmarks included lexical entailment, decoupling of common sense from knowledge, constituency identification and coreference resolution, which are critical in language understanding.

"Previously, it was difficult to model and predict on a sample set of sentences, and validation remains challenging as well," Mohanty said.

This is where these newer frameworks help in providing pre-trained models for labeled data sets made by experts/computational linguistic professionals. For example, using lexical entailment with GLUE, someone can easily develop a module using the pretrained model.

Similarly, SQuAD helps in answering the questions on unstructured paragraphs or sentences. The answers will not be 100% correct, but it can achieve 60% to 70% accuracy. Moreover, if someone wants to use this for their data set, they can use the pre-trained model with some refinement. There is no need to build it from scratch.

Different language phenomena

It's also important to keep in mind that the benchmarks measure language phenomena in a scientific way in a controlled environment, and results in the real world can vary, Hadayat Seddiqi, director of machine learning at InCloudCounsel, said. 

Today's benchmarks are easily beaten by the best models, only for researchers to realize that the benchmarks left gaps for very well-optimized models to exploit. Developers are more concerned with building something useful than understanding a phenomenon around language and perfecting it.

"This is bad for science, but not necessarily bad for business applications," Seddiqi said.

Developers need to know the extent that certain language phenomena need to be deeply modeled, and when it can be modeled in a surface-level way. The best models today show that they can answer questions to human-level performance (in the cases of GLUE and SQuAD), but they don't exhibit generalization outside of those data sets.

"This says something about the power and utility of these models, but it also says just as much about the data set you're training it on," Seddiqi said.

Better language models

In some ways, this recent progress mirrors the explosion of new deep learning techniques for image recognition after the publication of the large ImageNet data set. These new language models grew from a few thousand labelled data sets to hundreds of thousands. Data sets like GLUE expanded the scope of testing by providing multiple types of challenges, which has made it easier to create models that do well across many different types of tasks.

"I believe that the stunning progress we see across various NLP tasks is empowered through the evolution of benchmarks," Reggie Twigg, director of product marketing at ABBYY, an OCR tools provider, said

Those seeking the application of NLP want an algorithm to comprehend texts for various purposes -- from identification and classification to entity extraction into business processes. All the benchmarks -- GLUE, SQuAD and RACE -- address this setup. In practical terms, this usually means extracting meaningful information out of unlabeled and semi-structured data in documents and related messages, such as emails.

One of the dangers of the benchmarks, however, is that they may encourage developers to focus on incremental improvements on leaderboards, which disincentivizes riskier game-changing advancements, said Daniel Kobran, COO and co-founder of Paperspace, an AI development platform.

Just like horsepower

While improvements on the benchmarks do reflect real progress, it's important to keep in mind that these are developed by and for NLP researchers to provide some objective measure of performance, so they can decide which results and approaches are worth pursuing and publishing.

"You are not the intended audience, nor are the practitioners on your team who are actually putting the models into production," said Nate Nichols, distinguished principal at Narrative Science, a natural language generation tools provider.

With NLP benchmarks, other considerations should all be weighed at least as heavily as benchmark performance. These include ease of deployment and maintenance, the ability to incorporate into existing workflows, the required machine load, in-house experience and expertise, and cost.

You can think of NLP benchmarks as the horsepower of a car, Nichols said. If you're buying a car, knowing the horsepower can give a sense of what the car can do, and whether you could win any races. But most people think of horsepower as just one input into their evaluation of the car.

Other factors that should be considered are size, speed, processing power and energy consumption. For example, Google's BERT inspired a variety of recent NLP models optimized for various characteristics such as Huawei's TinyBERT, which is one-seventh the size and nine times faster.

"NLP projects don't typically fail because the underlying model or approach didn't perform well enough on some benchmark," said Nichols.

"They fail because there isn't enough training data, or there wasn't enough expertise to deploy and maintain the model, or the app or product surrounding the NLP component shifted direction."

Popular benchmarks

The most popular NLP data sets and benchmarks typically provide raw data for training and similar data that is used for performance testing and qualifying position on a leaderboard. Some of the most popular data sets include:

SQuAD (2016)

The Stanford Question Answering Data set v1.1 is a collection of 100,000 crowdsourced question/answer pairs drawn from Wikipedia.

SQuAD2.0, introduced in 2018, and builds on this with 50,000 unanswerable questions designed to look like answerable ones. To perform well, the NLP model must determine when the correct answer is not available.

RACE (2017)

This Reading Comprehension from Examinations includes more than 28,000 reading passages and 100,000 questions.

SWAG (2018)

The Situations With Adversarial Generations data set contains 113,000 sentence-pair completion examples that evaluate grounded commonsense inference.

GLUE (2018)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine different language understanding tasks.

SuperGLUE was introduced in 2019 as a set of more difficult tasks and a software toolkit.

Dig Deeper on AI technologies

Business Analytics
Data Management