Getty Images/iStockphoto

Why you need to consider using small data to train AI models

A smaller data set makes sense for certain applications, such as intelligent document processing. It is not helpful in cases in which a large volume is needed to avoid mistakes.

AI models are only as good as their training data.

Some models benefit from large amounts of data. A good example of that is OpenAI's Dall-E 2, which uses huge volumes of data to translate text and voice to images. Other models do not require a lot of data and would actually not benefit from more.

The idea that small data is just as essential for AI systems and technologies as big data is growing. A 2021 Scientific American article by Georgetown University researchers reported that one approach to small data is first training a model on big data and then retraining the model on a smaller data set. This is known as fine-tuning.

While there are areas in which big data is needed, such as autonomous vehicles, many other AI applications can function with a small amount of data, according to Lewis Z. Liu, co-founder and CEO of Eigen Technologies, a New York-based startup whose AI platform enables enterprises to extract data from documents.

In this Q&A, Liu discusses when small data is preferable to big data and how to make small data relevant.

What's the benefit of small data AI?

Lewis Z. Liu, co-founder and CEO, Eigen TechnologiesLewis Z. Liu

Lewis Z. Liu: If you're small, you have much more control. So you can be conscious about what kind of bias or nonbias [is present]. It's more about conscious bias versus unconscious bias.

When is small data preferable to big data for an AI model or system?

Liu: I would argue that in the case of intelligent document processing, you want to use small data AI.

On one hand, you have what I call high-bar, low-marginal-value documents. By low marginal value, I mean easy to automate -- things like passports, driver's licenses, W-2 tax forms. Those things are simple and really high volume -- most Americans have a W-2 form, right? Half of Americans have passports. Those are easy. Generally, you'll use the big data approach because you have high volume.

If you're small, you have much more control. So you can be conscious about what kind of bias or nonbias [is present].
Lewis Z. LiuCo-founder and CEO, Eigen Technologies

But if you look at most invoice processes, your finance department wants to process their invoices, but they may only have 1,000 invoices a year. If you are a Wall Street trader and you're trading some exotic derivative, they may only issue 200 derivatives. Or you're an insurance broker that insures residential property, and your brokerage firm may only get 1,000 of these property documents a year.

There are many more use cases and many more document types that are really high value because you're a lawyer or you're a banker or you're an insurance broker looking at these documents, but the document volumes are low per use case. So you actually need small data AI to tackle all these use cases. Furthermore, generally, the people looking at these documents are highly paid. Therefore, you actually get what I like to call 'lower volume, higher value.'

What happens when small data AI is not enough?

Liu: The data and the documents are just one part of the broader story in the business operation. Sometimes that's all you need. For some cases, you need to combine the data you get from documents and from other sources. For example, you're buying a house -- you need to look at the title insurance, you need to look at the land grant deed, and you need to look at the homeowner policy. You need to collect data from all of these sources, but you also need to collect data from bank accounts and all those things which are not from documents.

What's the direction of big data versus small data in AI?

Liu: This is highly use case specific. Big data sets are the future. You need a lot of data to train a self-driving car. There's no way you can use small data for that. However, in a lot of enterprise applications like intelligent document processing or automated insurance underwriting -- where there's a lot of these use cases, but they're all very specific -- small data is the way to go.

If big data is the future, how can small data AI remain relevant?

Liu: Big data AI is useful for a lot of applications, not all applications.

A human being is versatile, and the whole reason why human beings are so smart is the fact that we are sort of small data machines. We can learn from one or two examples, and then we can do it. If I show you a dance move twice, you can probably do that dance. That flexibility is what makes a human being so versatile in the workplace.

Using small data AI, you have one or two or three training examples, and you can train the AI to do a certain task. It's that flexibility that makes human beings shine. The future of AI is that some AI systems have that versatility and can shine in that way.

Editor's note: This Q&A has been edited for clarity and conciseness.

Dig Deeper on AI business strategies

Business Analytics
Data Management