QA systems are used in a variety of consumer-facing applications, but traditional QA evaluations do not yet reflect diverse user needs in realistic use cases. One such mismatch is that users would interact with QA systems through interfaces—such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system.
In this work, we advocate that practitioners construct a range of evaluations, reflecting real-world usage scenarios and potential users for their systems. We present:
ASR noise simulates the errors introduced when users interact with QA systems through a speech interface. This type of noise became more prominent with the advent of voice assistants, and practitioners need to handle such errors to accommodate the users who rely on voice input.
Translation noise replicates the effect of users passing their questions through a translation system. For many languages, reference information for a particular query might not be readily available online, so users may employ a machine translation engine to interact with a QA system built for another language.
Keyboard noise refers to the errors that originate in the process of typing—specifically, we focus on misspellings caused by keyboard layouts rather than user misconception. These typos might present a challenge when queries are entered into a search engine, and we find that a simple spellchecker might not be enough to combat them.