Javad Hosseini

Synthetic data generation for domain generalization: The cases of Natural Language Inference and Proposition Segmentation

Many NLP tasks are approached by either supervised training of language models on existing, human-annotated, task-specific data or by directly prompting large language models (LLMs), potentially using few-shot examples. Despite reporting high performance results on existing benchmarks for models trained with supervision, their realistic performance on out-of-distribution data is not necessarily as good. In addition, few-shot prompting LLMs is very costly to run at large scales, even if it shows good out-of-domain performance.

In this talk, I will present an approach for using Large Language Models (LLMs) to generate synthetic data for domain-generalization of existing models. Our approach has three main steps: 1) Generating synthetic raw text in many domains covering different text lengths. 2) Using the synthetic raw text and a teacher LLM trained on existing human annotated data to generate similar data in those domains. 3) Using the generated data to train scalable student models. We applied this recipe to two NLP tasks: Natural Language Inference, which happens to have access to large human-annotated datasets, and Abstractive Proposition Segmentation, which has a relatively small set of annotated data. In both cases, we show that our approach can be used to train scalable student models that work considerably better than training models on the original human-annotated data, when tested on out-of-domain datasets.

back to overview

Biography

Javad Hosseini is a research scientist at Google DeepMind, UK, working on problems related to the factuality of large language models. Before joining Google, Javad completed his PhD (2020) and spent time as a post-doctoral research associate at the Institute for Language, Cognition and Computation (ILCC), University of Edinburgh, under the supervision of Mark Steedman. He obtained his MSc in Computer Science from the University of Washington, and earned his MSc and BSc in Computer Software Engineering from Sharif University of Technology.

Imprint / Privacy

Global Software Technology Summit 2024

Spark New Software Ideas, Discuss Software Architecture and Technology

Javad Hosseini

Synthetic data generation for domain generalization: The cases of Natural Language Inference and Proposition Segmentation

Biography