Replication Package of "Sorry, I don't Understand: Improving Voice User Interface Testing"
Voice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers on which they can build their own apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows to use natural language commands to perform actions. Testing such systems is far from trivial, especially due to the fact that the same command can be expressed using several semantically equivalent utterances, to which the VUI is expected to correctly react. To support developers in testing VUIs, Deep learning (DL)-based tools have been integrated in the development environments to generate paraphrases for selected seed utterances. This is for example the case of the Alexa Developer Console (ADC). Such tools, however, generate a limited number of paraphrases and do not allow to cover several corner cases. In this paper, we introduce VUI-UPSET, a novel approach that aims at adapting chatbot-testing approaches to VUI-testing. Indeed, both systems provide a similar natural-language-based interface to users. We conducted an empirical study to understand how VUI-UPSET compares to existing approaches in terms of (i) correctness of the generated paraphrases, and (ii) capability of revealing bugs. We manually analyzed 5,872 generated paraphrases, totaling 13,310 evaluations. Our results show that the DL-based tool integrated in the ADC generates a significantly higher percentage of meaningful paraphrases compared to VUI-UPSET. However, VUI-UPSET generates a higher number of bug-revealing paraphrases, which allows developers to test more thoroughly their apps at the cost of discarding a higher number of irrelevant paraphrases.