Blog

Voice Interfaces

-- posted November 2016 --

Voice interfaces are software programs supported by the necessary hardware, designed to facilitate verbal communication and interaction with computers, robots, and various mechatronic devices. The latest advances in deep artificial neural networks have substantially accelerated machine’s proficiency in speech understanding and conversational interaction, by relying on discovering patterns in large corpora of language data.

Voice interfaces can be broadly classified into 3 categories, based on the level of capabilities. The first level pertains to the ability for speech recognition, that is, converting voice acquired as acoustic signals into a textual format for computer processing. The progress in speech recognition has been impressive, and nowadays this task can be done very reliably. If one talks clearly and understandably without excessive background noise, it is very likely that the voice recognition will be performed correctly. The second level of communication interfaces possesses competency for natural language processing, which means that a machine can interpret and understand human language. This is where things start to get tricky. Our language can be ambiguous and difficult to interpret by computers due to a variety of reasons. For instance, the same word can be a noun, verb, or adjective when used in a different context, and as a result, it may be difficult to derive semantic relationships between the words in a sentence. Here is an example of a failed interpretation of a dialog with a chatbot: ”Please call me a taxi. – Ok Ataxi, how can I help you?” Additionally, our speech often requires a broad cultural and social knowledge in order to comprehend the meaning, it may contain slang and local dialect words, which can all lead to unsurmountable obstacles for machines. And the third attribute that conversational interfaces require is the ability to reply to a query or provide requested services. If the request or the question are concentrated on a specific task or a topic, this is doable. For instance, assuming that a chatbot would have access to internet, it can easily provide answers to questions related to the local weather, traffic conditions, the operating hours of a store, or other miscellaneous facts, such as the score of last nights’ game, for example. However, if the questions span over different topics and require an opinion or insight, the chatbot can get confused and may provide not-so-intelligent responses. Related to this level of voice interaction is also the capability for text-to-speech synthesis. Earlier this year Google’s team DeepMind proposed an approach for speech rendering called WaveNet that can produce smooth-flowing realistic speech, conversely to the older similar programs that produced interrupted speech as a series of syllables.

The prevalent criterion with regards to the evaluation of the intelligent capacity of a conversational bot is still the Turing test, proposed by the famous British scientist Alan Turing (known as the developer of the first prototype of a computer, as well as for cracking the German secret codes during the World War II). The Turing test involves conversations between human evaluators and other people and a chatbot, where the participants don’t see each other and the communication is conveyed through text messaging. An intelligent machine will pass the test if the interrogators cannot distinguish whether they conversed with a human or with a bot. This type of test is also known as the imitation game (which is also the title of the eponymous movie about Alan Turing). Up until now, no chatbot has passed the test (despite certain claims). If you have ever tried to chat with one of these bots, you will note that at the present time it takes only a few questions on different topics to find out that you are not chatting to a human. Undoubtedly, as time passes it would be increasingly more difficult to distinguish a program from humans, and at some point in the future we will witness passing the test.

One interesting example of a chatbot called Eugene personifies a 13 year-old boy, that evades questions and switches the topic all the time by asking irrelevant questions, as you would expect from a distracted teenager with a short span of attention and a lack of general knowledge. An exemplary unsuccessful chatbot was Tay, developed by Microsoft, that was supposed to be engaged in casual online chats with people on Twitter, and learn social and communication skills from these conversations. The intent was to gradually learn by imitating human responses on social media, and refine learned skills over time. Unfortunately, the bot was not adequately prepared for attacks and abuse by users who taught it some politically inappropriate and highly offensive views and ideas. Microsoft had to take the chatbot offline in less than 24 hours and publically apologize, after Tay began re-twitting hateful, racist, Nazi, misogynistic statements and images, and all sorts of insulting remarks and profanities at the other users.

Among the first available conversational apps was Apple’s Siri. Today we have Microsoft’s Cortana and Google’s Assistant for cell phones and computers. Amazon’s Echo home speaker uses Alexa as a voice interface. Google has recently released a similar speaker Google Home (sold for $129), that plays a role of a control center for all home smart devices, e.g., using voice commands one can request to play a song, stream a movie on the TV, control the temperature, turn on/off lights, order a pizza, etc. Voice interfaces can leverage the provision of miscellaneous type of assistance from intelligent bots based on knowledge about the world or about the user’s personal preferences, they can bring the new technologies closer to the less technologically-savvy population, can serve as virtual assistants and companions for the elderly, and can shift the way we interact with computers and processors embedded in various devices toward voice-based dialog and spoken communication.

Here is a 5:01 minute video that talks about chatbots.

https://www.youtube.com/watch?v=MT4JWtm5n5M

Image source: http://www.jp.honda-ri.com/english/projects/intelligence/01.html/

Back

Aleksandar (Alex) Vakanski