How Conversational AI systems understand language
Understanding speech is a nuanced problem for a computer program. The literal interpretation of speech may be different from or influenced by the tone, expression, and subtext.
Voice-activated systems like Alexa and Siri are great for instruction but having sustained dialogue seems to be a harder, more human problem to crack. Spoken Dialog Systems (SDS) parse speech to understand the users’ intentions. These systems analyze the semantic concepts involved in speech — ideas that give a phrase or sentence its general meaning.
Semantic concepts in human conversations may have subsemantic, subversive, and multi-entendred concepts. Humans are sarcastic, referential and often share a mutual understanding of context and subtext. These are complexities for machines to explore and infer.
Unsupervised learning, a subdiscipline of machine learning, is a statistical-computational concept of gathering information, trends, and features from raw, unlabelled data. For example, using natural language processing (NLP) programs and clustering techniques to summarize a corpus of students’ notes from a lecture to identify the main, maximally mutual concepts. This principle is used to extract knowledge from raw speech data.
Systems like Siri have a Spoken Language Understanding (SLU) component to correlate natural speech to its meaning. SDS use this meaning and the structure of the speech to understand users’ utterances. Most systems limit conversations to a few predefined topics though there is a strong initiative to generalize and scale the systems.
The system proposed by Chen aims to understand intents using knowledge acquisition models and Matrix Factorization. These models identify the explicit and latent features. For instance, in the query, “I would like a cheap restaurant”, the explicit features are ‘cheap’ and ‘restaurant’. The semantic concept induction correlates the words to underlying concepts or slots. So, ‘cheap’ is a measure of ‘expensiveness’, and a ‘restaurant’ is a ‘target’ location.
The concept relational model correlates underlying concepts and assigns probabilities to semantic influences of words that may not be present in the user’s query. Though ‘food’ isn’t explicitly mentioned in the utterance, the model learns that ‘food’ is associated with ‘restaurant’ with a 0.85 probability, as you can see in the bottom right of the above image. The system, hence, creates a matrix of probability values of occurrences of concepts in sentences. Each row is an uttered sentence or phrase, and each column is a probability of a concept to shape the meaning of the sentence.
A pre-trained semantic parsing system is used that can automatically extract a graph of inter-dependent concepts from an utterance like in the image above. A word-based lexical knowledge graph, conceptually exemplified in the image below, is used in conjunction to capture the ontological structure of the content. These are immediately generated when the utterance is perceived.
The aforementioned probability matrix is compounded by the matrices representing the two knowledge graphs, visualized above, to form the final matrix. Matrix factorization can be used to extract latent features from this matrix. This linear algebraic technique is used because it is useful to model noisy data, hidden information, and observational dependencies.
The goal of factorization is to use the existing knowledge, captured in matrices, and its factors, to infer if a particular concept is present in an utterance. This lets the SDS understand the context and hidden meanings. It also helps address ambiguities since the system has an ontological, relational, and structural understanding of a word in a sentence.
In addition to predicting semantic concepts, this system can also understand if a query is domain-specific. Adding more domain knowledge can further the specification. Incorporating additional data, such as from Google’s knowledge graph, can help in generalization, and scalability.
This article is a summary of this paper. I look forward to emotionally intelligent conversational AI systems.