pio3 - Fotolia
Natural language processing benchmark metric processors such as General Language Understand Evaluation, or GLUE, and Stanford Question Answering Dataset, or SQuAD, provide a great backdrop for improving NLP models, but success on these benchmarks is not directly applicable to enterprise applications. What are good metrics to improve the performance of your natural language processing models on real-world applications?
There are tradeoffs between benchmarks that are generalizable for comparing the performance across different frameworks or algorithms and something that is purpose-built to measure the performance of a particular model or use case.
Enterprise NLP applications require customizing metrics for a given use case, measuring the success of different models and repeating with variations.
Focus on salience
The best way to customize NLP metrics for an individual app is to focus on the most salient aspects of a machine learning model for a particular use case. Eschew benchmark metrics for success in favor of specific use cases, like NLP for contracts.
"The best metrics are those specific to the task at hand," said Daniel Kobran, COO and co-founder of Paperspace, an AI development platform.
Say you have two systems: One can correctly answer a variety of questions based on the text of the agreement, but performs poorly on the financial aspects of the deal. Meanwhile, another system extracts the financial information and works minutely with it, while not paying attention to anything else. The first system would be useless for a variety of business applications that include finance, but would probably have higher benchmark scores than the second one.
"Since business processes and applications are the consumer of NLP, they set a tougher set of requirements," said Reggie Twigg, director of product marketing at ABBYY. This is compounded when each organization will have different demands for NLP under which models need to perform and be measured.
Include user value
Real-world factors play a big part in determining the eventual success of your NLP project. Some of the common -- and expected -- metrics can include things like accuracy, effort, cost or training data required, but these are only part of the story. It's important to ensure your team does not confuse scoring high against the benchmark with actually providing value to the user.
"The danger of a single number to focus on is that developers and data scientists can fixate on driving that number as high as possible, at the risk of losing sight of what is actually most important for your users and customers," said Nate Nichols, distinguished principal at AI company Narrative Science.
An NLP model may perform well enough for your users who would prefer you spent your resources improving other parts of your app or service. Meanwhile, your team of NLP practitioners are regularly eking out incremental gains against a predetermined benchmark score, which users might not even notice. Perhaps the most useful metric is determining the desired UX and user value of the app and how NLP can boost the experience.
Understand your problem domain
Understanding what aspects of language need modeling for your particular tasks requires deep knowledge of both language and your problem domain, said Hadayat Seddiqi, director of machine learning at InCloudCounsel, a legal tech company.
Seddiqi's team that develops AI models to help understand various legal documents looks at the tasks for a project and compares them against open benchmarks to assess the potential of different models. It is also important to have a strong annotation and error analysis pipeline where people can spend time with the data and the model to understand it.
"Ultimately the best measurements come from a true understanding of the problem you're dealing with," Seddiqi said.
For example, they found the SQuAD benchmark relevant because they also do question answering. But legal contracts have a different structure compared to trivia from Wikipedia, which is what SQuAD is based on, so not all the learnings transfer, Seddiqi said.
Consider developer UX
Building models can be automated easily -- but analyzing behavior and model durability is a distinctly human endeavor that requires creativity and critical thinking with a focus on your own particular needs.
Iteratively assessing the strength of different machine learning models starts with exploring the data, planning and writing about hypotheses and measurements, which are confirmed during the analysis portion where errors and other unexpected model behaviors are scrutinized.
Seddiqi's team has been working with a concept they call "developer UX" that involves building the right interface for AI developers and data scientists to help them gain the knowledge they came for as quickly as possible. Sometimes this relies on building good software tools, and other times it requires a mathematical tool that makes uncovering NLP metrics very simple.
Coincidentally, Nichols expects this process to become more standardized and commoditized, just like traditional software deployment has become over the last few years. In the meantime, fully managed services like Amazon SageMaker, Azure Machine Learning Studio and Paperspace Gradient are helping to analyze, create and develop NLP based on user or app-specific metrics for success.