Javad Hosseini
Synthetic data generation for domain generalization: The cases of Natural Language Inference and Proposition Segmentation
Many NLP tasks are approached by either supervised training of language models on existing, human-annotated, task-specific data or by directly prompting large language models (LLMs), potentially using few-shot examples. Despite reporting high performance results on existing benchmarks for models trained with supervision, their realistic performance on out-of-distribution data is not necessarily as good. In addition, few-shot prompting LLMs is very costly to run at large scales, even if it shows good out-of-domain performance.
In this talk, I will present an approach for using Large Language Models (LLMs) to generate synthetic data for domain-generalization of existing models. Our approach has three main steps: 1) Generating synthetic raw text in many domains covering different text lengths. 2) Using the synthetic raw text and a teacher LLM trained on existing human annotated data to generate similar data in those domains. 3) Using the generated data to train scalable student models. We applied this recipe to two NLP tasks: Natural Language Inference, which happens to have access to large human-annotated datasets, and Abstractive Proposition Segmentation, which has a relatively small set of annotated data. In both cases, we show that our approach can be used to train scalable student models that work considerably better than training models on the original human-annotated data, when tested on out-of-domain datasets.
In this talk, I will present an approach for using Large Language Models (LLMs) to generate synthetic data for domain-generalization of existing models. Our approach has three main steps: 1) Generating synthetic raw text in many domains covering different text lengths. 2) Using the synthetic raw text and a teacher LLM trained on existing human annotated data to generate similar data in those domains. 3) Using the generated data to train scalable student models. We applied this recipe to two NLP tasks: Natural Language Inference, which happens to have access to large human-annotated datasets, and Abstractive Proposition Segmentation, which has a relatively small set of annotated data. In both cases, we show that our approach can be used to train scalable student models that work considerably better than training models on the original human-annotated data, when tested on out-of-domain datasets.
back to overview