IBM's Lambada AI generates training data for text classifiers

What’s a data scientist to do if they lack sufficient data to train a machine learning model? One potential avenue is synthetic data generation, which researchers at IBM Research advocate in a newly published preprint paper. They used a pretrained machine learning model to artificially synthesize new labeled data for text classification tasks. They claim that their method, which they refer to as language-model-based data augmentation (Lambada for short), improves classifiers’ performance on a variety of data sets and significantly improves upon state-of-the-art techniques for data augmentation.

“Depending upon the problem at hand, getting a good fit for a classifier model may require abundant labeled data. However, in many cases, and especially when developing AI systems for specific applications, labeled data is scarce and costly to obtain,” wrote the paper’s coauthors. “Depending upon the problem at hand, getting a good fit for a classifier model may require abundant labeled data. However, in many cases, and especially when developing AI systems for specific applications, labeled data is scarce and costly to obtain.”

Generating synthetic training data tends to be more challenging in the text domain than the visual domain, the researchers note, because the transformations used in simpler methods usually distort the text, making it grammatically and semantically incorrect. That’s why most text data augmentation techniques — including those detailed in the paper — involve replacing a single word with a synonym, deleting a word, or changing the word order.

Lambada leverages a generative model (OpenAI’s GPT) that’s pretrained on large bodies of text, enabling it to capture the structure of language such that it produces coherent sentences. The researchers fine-tuned their model on an existing, small data set, and used the fine-tuned model to synthesize new labeled sentences. Independently, they trained a classifier on the aforementioned data set and had it filter the synthesized corpus, retaining only data that appeared to be “qualitative enough” before re-training the classifier on both the existing and synthesized data.

To validate their approach, the researchers tested three different classifiers — BERT, a support vector machine, and a long short-term memory network — on three data sets by running experiments in which they varied the training samples per class. The corpora in question contained queries on flight-related information, open-domain and fact-based questions in several categories, and data from telco customer support systems.

They report that Lambada statically improved all three classifiers’ performance on small data sets, which they attribute in part to its controls over the number of samples per class. Said controls allowed them to invest more time in generating samples for classes that are under-represented in the original data set, they said.

“Our augmentation framework does not require additional unlabeled data … Surprisingly, for most classifiers, LAMBADA achieves better accuracy compared to a simple weak labeling approach,” wrote the coauthors. “Clearly, the generated data set contributes more to improving the accuracy of the classifier than … samples taken from the original data set.”