Challenge Set Evaluation for User-Centric Question Answering


To appear at EACL 2021

QA systems are used in a variety of consumer-facing applications, but traditional QA evaluations do not yet reflect diverse user needs in realistic use cases. One such mismatch is that users would interact with QA systems through interfaces—such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system.

In this work, we advocate that practitioners construct a range of evaluations, reflecting real-world usage scenarios and potential users for their systems. We present:

  • A detailed description of interface noise and associated ‘challenges of the channel’ for QA systems. We examine errors introduced by three kinds of interfaces: speech recognizers, keyboard interfaces, and translation engines.
  • An extensive analysis of interface noise and its impact on the downstream question answering.
  • An initial exploration of mitigation strategies for interface errors. Our mitigation strategies reduce BERT errors on interface noise by 31% on average.
You can find all the details in our paper here. Our error analysis highlights considerations to inform the design of future QA systems and evaluations. We emphasize the need for holistic evaluation of QA systems, showing that model selection based on benchmark dataset performance alone can be misleading. Below we illustrate the three kinds of interface noise:

Speech Interfaces

ASR noise simulates the errors introduced when users interact with QA systems through a speech interface. This type of noise became more prominent with the advent of voice assistants, and practitioners need to handle such errors to accommodate the users who rely on voice input.

Translation Interfaces

Translation noise replicates the effect of users passing their questions through a translation system. For many languages, reference information for a particular query might not be readily available online, so users may employ a machine translation engine to interact with a QA system built for another language.

Keyboard Interfaces

Keyboard noise refers to the errors that originate in the process of typing—specifically, we focus on misspellings caused by keyboard layouts rather than user misconception. These typos might present a challenge when queries are entered into a search engine, and we find that a simple spellchecker might not be enough to combat them.