Applying the Levels of Linguistics to Speech Recognition Systems
Speech recognition software has been available since the 1990s. Currently, speaker dependent systems that require some
training can capture continuous speech with a large vocabulary with an accuracy
of 98% (two words out of a hundred wrong) under optimal conditions.
Despite the apparent success of the technology, few people use it as an
alternative to the keyboard despite the fact that most people can talk
considerably faster than they can type.
Speech recognition systems have been very successful in other areas such
as cell-phone dialing, where a system with minimal amount of training can
recognize a small number of words very accurately, as spoken by most English
speakers and in the worst ambient-noise environments. Most of speech
recognition software’s recent success has come from dealing with speech
recognition by directly matching sounds with words. That approach has some theoretical
limitations that may explain why its use hasn’t caught on for general
dictation:
Computer
speech recognition accuracy is very variable.
The accuracy drops significantly depending on the particular speaker or
the ambient noise, unlike human speech recognition which is relatively
insensitive to speaker or environmental differences.
The
interpretation of many words and phrases are context sensitive, such as the use
of Homophones, words where a single pronunciation can have two or more
meanings, e.g. (to/two/too; flower/flour).
Intonation
and speech timbre can completely change the correct interpretation of a word or
sentence. As an example, the meanings of
the phrases "Stop!" and “Stop?" are easily differentiated by a
human, but not so easily by a computer.
Microsoft and SAPI, i.e., its Speech Application Programming Interface software has arguably been the most successful in increasing accuracy rate over time compared to its competitors. Though there are alternatives, SAPI dominates the market and will probably continue to do so – it has become the standard for speech recognition. In addition, Microsoft sets standards for programmers to interface to linguistic rules to increase speech recognition accuracy. Therefore, programmers and those writing the programmers specifications will need at least a passing understanding of linguistic levels to deal with Microsoft’s environment.
The goal of the paper is to explore the levels of linguistics and review their use in increasing the accuracy of speech recognition using SAPI, Microsoft’s speech recognition tool.
Table of Contents
Phonetics, the first level of linguistics, is the study of the physical nature of the sounds pronounced in a language. The base unit of phonetics is the “phone”, which is defined as a discrete sound of human speech. Therefore, phonetics studies the conversion of sounds into phones, regardless of the language used.
Most desktop computers with the Windows operating system have a built-in speech recognition function called SAPI to perform this phonetic translation.[1] Technically, SAPI is not a software program, but rather a standard for other software applications to interface with the speech recognizer within the Windows operating system. SAPI converts spoken input into recognized text.
In order to accomplish speech recognition as accurately as possible, SAPI processes information from different levels of language. As SAPI uses phonetics to acoustically match sound waves with the phones of a language, SAPI operates as a “black box” interface, where neither the programmer nor the user has access to or care about how SAPI translates sounds. SAPI converts sound waves into phones without any user or programmer intervention. However, there are some aspects of the “black box” that can be controlled by decisions or input from programmers or users. For example, specifying the language changes how the sounds are mapped to phones.
The “black box” of phonetics within SAPI has to deal with the problem that no two people say the same phone in exactly the same way. Consequently, if SAPI uses a generalized mapping of sounds to phones across all speakers, the accuracy of speech recognition is less than optimal. Therefore, SAPI allows for the creation of Recognition Profiles that map sounds to phones for a specific speaker. These profiles allow SAPI to provide a higher level of speech recognition accuracy for a particular person. In order to facilitate updating of such speaker profiles, Microsoft provides add-on software such as Microsoft's Recognizer 5.1.
To create a recognition profile, the speaker reads aloud one of several “training” passages provided by Microsoft as may be seen in Figure 1. Since SAPI knows the exact sequence of phones in the chosen passage, it can reconfigure its recognition of phones for that particular speaker. Though the Virtual Patient can use SAPI in a speaker-independent mode, the use of a recognition profile will increase the accuracy of speech recognition by as much as 20%[2].
Figure 1 – Training a Recognition Profile
In summary, SAPI accomplishes the phonetic level of speech recognition by transparently converting the sound waves of speech into phones. Any changes necessary at this level are handled either by specifying the language or by building a recognition profile for a particular speaker. However, a speech recognition system that operates at just the phonetic level of speech will not be very accurate in identifying which words are spoken because phones do not map directly to words. The next section discusses phonology, which focuses on phonemes as an intermediate unit between phones and words.
Phonology, the next level of linguistics, studies the conversion of phones into phonemes, which are the sounds perceived as being different by the speakers of a language. Some languages have two phones as the same sound. For example, there are two different ways to say the sound “p” in English, one with a small puff of air at the end and the other without (i.e., aspirated “p” vs. unaspirated “p”). Both forms of “p” are phones of the English language because they are acoustically different. However, they map to the same phoneme because speakers do not perceive these acoustic differences in this language. They hear both of these phones as the same, namely the phoneme “p”. Because speakers differentiate words as sequences of phonemes and not as sequences of phones, the word “stop” can be said using either form of “p”.
Phones and phonemes do not have to have a one-to-one correspondence in a language. As an example, several different phones can represent one phoneme, such as the different sounds of "g" in the phonemes as used in the words "get," "anger" and "gentle. Because of this lack of correspondence, most speech recognition engines either do not operate on this level or their use of this level is hidden from the programmer or the user. SAPI deals with the problem of mapping different phones to their respective phonemes in ways that are transparent to the programmer.
However, SAPI does allow the programmer to have more input with regard to the mapping of sequences of phonemes to words by specifying the sequence of phonemes using Microsoft’s SYM phoneme set, i.e., SYMbolic phonetic representation. SAPI provides four different phoneme tables for different languages – international, Chinese, Japanese and English, as seen in Table 1 - The America English Phoneme Table.
|
SYM |
Example |
PhoneID |
|
- |
syllable boundary (hyphen) |
1 |
|
! |
Sentence terminator (exclamation
mark) |
2 |
|
& |
word boundary |
3 |
|
, |
Sentence terminator (comma) |
4 |
|
. |
Sentence terminator (period) |
5 |
|
? |
Sentence terminator (question
mark) |
6 |
|
_ |
Silence (underscore) |
7 |
|
1 |
Primary stress |
8 |
|
2 |
Secondary stress |
9 |
|
aa |
Father |
10 |
|
ae |
Cat |
11 |
|
ah |
Cut |
12 |
|
ao |
Dog |
13 |
|
aw |
Foul |
14 |
|
ax |
ag0 |
15 |
|
ay |
Bite |
16 |
|
b |
Big |
17 |
|
ch |
Chin |
18 |
|
d |
Dig |
19 |
|
dh |
Then |
20 |
|
eh |
Pet |
21 |
|
er |
Fur |
22 |
|
ey |
Ate |
23 |
|
f |
Fork |
24 |
|
g |
Gut |
25 |
|
h |
Help |
26 |
|
ih |
Fill |
27 |
|
iy |
Feel |
28 |
|
jh |
Joy |
29 |
|
k |
Cut |
30 |
|
l |
Lid |
31 |
|
m |
Mat |
32 |
|
n |
No |
33 |
|
ng |
Sing |
34 |
|
ow |
Go |
35 |
|
oy |
Toy |
36 |
|
p |
Put |
37 |
|
r |
Red |
38 |
|
s |
Sit |
39 |
|
sh |
She |
40 |
|
t |
Talk |
41 |
|
th |
Thin |
42 |
|
uh |
Book |
43 |
|
uw |
Too |
44 |
|
v |
Vat |
45 |
|
w |
With |
46 |
|
y |
Yard |
47 |
|
z |
Zap |
48 |
|
zh |
Pleasure |
49 |
Table 1 – SAPI’s American English
Phoneme Table
As an example, in Microsoft’s SYM phoneme set, the standard pronunciation of “hello” in English is represented as "h eh l ow". If the programmer wanted to increase speech recognition accuracy, listening for a pause (also known as a syllabic boundary) could be added as "h eh - l ow". If the programmer wanted to make sure the program could translate the phoneme combination if the primary stress was on the second phoneme, the programmer would add "h eh - l ow 1". Each phrase is space delimited.
SAPI refers to this method of translating phonemes to text as Dictation. Dictation doesn’t use rules of speech; the determination of what’s said is based only on vocabulary. Dictation uses a large translation table called a Dictionary, which can contain either words applying to a wide range of topics or words used in a particular context, such as medical or legal. An example of a medical dictionary entry might be the mapping of the pronunciation of “s k l eh r – ow 1 - s eh s” to the word “Sclerosis”.
To summarize, the phonological level handles the translation from phones or individual sounds to phonemes, the basic units of a language. This chapter also introduces some of the concepts necessary to understand the conversion of sequences of phonemes to words. The next section discusses morphology, the layer between phonology and syntax.
Morphology, the next level of linguistics, studies the conversion of phonemes into morphemes, which are the smallest units of a language that carry meaning. As an example, the word “student” consists of a single morpheme - a learner enrolled in an educational institution - while the word “students” consists of two morphemes, student and –s, with the second morpheme adding the meaning “more than one”.
Morphemes can be subcategorized into stems and affixes, with the “stem” morpheme supplying the main meaning and the “affix” morpheme either modifying the meaning slightly or specifying the meaning more precisely. In the English language, affixes are further divided into prefixes which precede the stem and suffixes which follow the stem.
Speech recognition engine accuracy benefits by applying rules relating to prefixes and suffixes. If word recognition were just based on lower levels of language, both singular and plural versions of a noun would need to be in the list of recognized sound-to-word translations. As an example, since the suffix “-s” usually means plural, the engine can recognize that the morpheme “s” at the end of a word usually refers to the plural of the word. Consequently, instead of multiple lists of all the variations of the same word, the engine can more efficiently use a list of root morphemes and the rules for applying affixes.
Applying morphological rules can significantly increase speech
recognition accuracy. The next section
discusses the syntax level which deals with translating morphemes into words.
Syntax, the next level of linguistics, studies the rules for combining words into sentences. Syntax is basically concerned with word order and looks at how words can be categorized into different parts of speech and what rules are necessary to combine these parts of speech into a recognized sentence structure. The syntax level is very useful in speech recognition systems because it gives a significant amount of information about the translation of a particular word by looking at its neighbors.
Identifying how a word is used in a sentence relies upon the classification of the word into one of the eight parts-of-speech: noun, verb, pronoun, preposition, adverb, conjunction, participle and determiner. Knowing a particular word’s part-of-speech helps the program to determine the remaining words in the sentence. As an example, a possessive pronoun (my, your, his, her, its) is likely to be followed by a noun or an adjective, as in the sentences “My ball is brown” or “My youngest daughter plays soccer”. A personal pronoun (I, you, he or me) usually precedes or follows a verb, such as “I love you”.
Some word classes are more distinctly used than others, such as prepositions (at, by, over, under, etc.) and conjunctions (and, but, or, etc.). There are few words in English belonging to these classes and how they are used doesn’t change much over time. As such, the rules regarding their use can be easily generalized between sentences. In contrast there are several thousand nouns, with new nouns being added to the language almost daily. Nouns are complex because they may be used in several distinctly different senses. As an example, the word "set," has 430 definitions in the Oxford English Dictionary (roughly 60,000 words), each with a corresponding part-of-speech categorization. To add to the complexity, some words such as “bridge” can be used as either a noun or a verb and requires a more complex process to determine how they are being used in a sentence.
Sometimes the pronunciation of a word will provide an insight as to how the word is being used in a sentence. As an example, the noun “content” is pronounced as “CON tent” whereas the adjective “content” is pronounced “con TENT”. Other examples include “ob JECT” (verb) and “OB ject” (noun), “DIS count” (noun) and “dis COUNT” (verb).
Once classified, the words in a sentence can be transformed into a hierarchical structure tree or tree diagram of the sentence. Such trees provide the following information about sentence structure:
the
grouping of words into phrases and clauses
the
syntactic categories of the word groupings
the
hierarchical structure of the syntactic categories
As an example, the phrase structure tree for the sentence “The dog chased the cat into the garden” is shown in the following figure. Each word is categorized by part-of-speech (N for Noun, V for Verb, etc.). The words are then grouped into nodes, such as NP for Noun Phrase, VP for Verb Phrase or PP for Prepositional Phrase.
Figure 2 - Example of a Phrase Structure Tree
Each different hierarchical level has different associated rules to help determine how to categorize the components. One example of a higher level rule is the test for phrase movement, i.e. the noun phrase “the dog” could be moved to another part of the sentence and the sentence would remain grammatically correct. Another higher level rule example is to look for “units of meaning”, i.e., “the dog” corresponds to a unit of meaning but the phrase “chased the” doesn’t.
Syntax rules and sentence structuring can help a speech recognition system deal with many higher-level speech recognition problems, such as homophones, where one sequence of phonemes has two or more different meanings and are spelled as different words ("dear" and "deer" or “to”, “two” and “too”). Any speech recognition system operating just by pairing sounds with words would be unable to differentiate between homophones. As an example, “I ate the two blue candies” and “Eye eight the too blew candies” are said the same way, but only the first is correct. The use of syntactic rules can help determine which words of the homophone pairs to use. To determine whether to use “blew” or “blue”, a program might look at the word “the” preceding “blew” or “blue”. “The” is usually followed by a noun or adjective, but not a verb. Since “blew” is a verb and “blue” is an adjective, the speech recognition program would choose “blue”.
One major weakness of this technique is the requirement that each entry in a speech recognition system’s dictionary must also specify the part-of-speech. Words will occur that aren’t in the dictionary, especially in the case of proper names and acronyms. To handle this, the best speech recognition algorithms use clues to determine the highest probability match for the word’s part-of-speech. As an example, words ending in “s” are likely to be plural nouns where words ending in “ed” are probably past participles. If there are no clues, the program will assume that the part-of-speech is probably a noun, with a verb as the next most likely candidate, since this is true of most words in the English language.
SAPI allows the programmer to write syntax rules by providing an alternative to Dictation called “Command & Control” or C&C. The C&C mode combines the list of valid words with structure rules and a list of sets to be used within the rules. The addition of rules allows the C&C mode to handle variable vocabularies and a wide variety of words more efficiently.
The rules as defined by SAPI can be very simplistic. As an example, a rule could be a list of words to recognize as verbs in a particular context, such as “walk, run, etc.” Another rule could limit the phrase structures to look for in a particular sentence.
In the C&C mode, rules can call other rules. A programmer can expand and build the vocabulary very easily by expanding the rule sets. In addition, the sets used by the rules are easily expanded while the program is running. A system can be created with a basic list such as “Chicago”, “Detroit” and the program can expand the list over time. Therefore, a basic system can adapt to meet the needs of different users. The C&C mode can offer a high degree of flexibility with very little development cost or complication. Limited systems that execute simple commands can be built very easily.
The rules for a very complex C&C mode are not easily created. Performing more complex tasks requires a larger set of more complex rules. As the number of rules increase, the chances of rules conflicting increase significantly and processing time gets longer. This decreases the practicality of using this mode for complex tasks such as resolving the ambiguity of the sentence “I hit the man with the gun” (which could mean either “I used the gun to hit the man” or “I hit the man who had a gun”).[4]
In summary, the syntax level looks at how words can be categorized and structured in sentences to increase the accuracy of speech recognition. This section also introduced several concepts valuable in speech recognition such as parts-of-speech, phrase structure trees and SAPI’s Command and Control mode. The next section discusses semantics.
Semantics is concerned with the meaning of words, expressions and sentences and is a different way of approaching the translation of speech to text. The base-unit of semantics is the “lexeme” which is a recognized symbol made up of one or more words that are shared across a particular culture or group’s use of language. At its most simplistic level, the study of semantics looks at the answers to such questions as “Is this lexeme true or false?” meaning does a concept match known facts, or “Is there enough information in the lexeme to determine its meaning?”
Lexemes can have connotative and/or denotative meanings. The denotative meaning is the meaning found in a dictionary and usually carries with it no emotional associations. The connotative meaning refers to the personal or emotional associations aroused by words and are connected to the sociological use of the word in a particular culture.
Over time as a culture changes, the connotative association of a word can change the denotative meaning and a new definition is created. One example of this is the word “gentleman” which used to mean “someone who owned land”. In the Medieval ages, the sentence “He was a gentleman and a liar” made perfect sense – it meant that a particular male was both a landowner and untrustworthy. Over time, since most landowners came from a particular high-level class with a similar set of “good” behaviors, the word began to denote people from that class and eventually came to mean the behaviors associated with the class rather than the class itself. The sentence “He was a gentleman and a liar” now no longer makes sense, since the meaning of “gentlemen” (i.e. those with “good” behavior) conflicts with the known behavior of “liars”.
Semantics allows the listener to reinterpret what was said to match the context of the rest of the message. As an example, a person who heard the sentence “He had been shot in the belly and was fast drying …” may unconsciously change the last word to “dying”, since the word “dying” matches the context of the phrase “shot in the belly”.[5]
Semantics is a different approach to increasing speech recognition accuracy. The semantic level uses phrase structure rules, but classifies the constituents in terms of their function or meaning rather than their syntactic categories. As such, semantics shares a lot in common with the syntax level, including the use of functional classifications similar to parts-of-speech, grammar rules called semantic grammar and the need for a large semantic dictionary to accurately map a phrase to its meaning.
One minor but very practical example of the use of semantics in speech recognition is the “optional phrase”, usually an imperative preceded by a noun or pronoun of address. For example, the question "How can I help you?" could be preceded by "Sir," "Madam," or "Miss" which is an optional phrase.
root
-> *title command
title
-> sir | madam | miss
command
-> how can I help you
The power of using optional phrases
comes from their ability to handle large variances in how data is presented to
the computer. The appendix contains a
detailed example of how 40 semantic grammar rules can handle over 2,664
different types of requests for information.
SAPI is currently undergoing major changes at this area of speech recognition. Microsoft released their Speech Application Software Design Kit, v1.0 or SASDK, which incorporates the Semantic Markup Language or SML while this paper was being written. The SML returns its interpretation of what was said as a “semantic result”. As an example, if the words “the Airport at Montgomery, Alabama” were the input to a semantic grammar used in the context of specifying airports, SML can be programmed to return a value of “MGM”. In addition to the output phrase, the SASDK can also return the “utterance confidence”, a floating point value between 0.000–1.000 that communicates how well the SAPI core engine recognized each word in the phrase.
To summarize, the integration of the meaning of lexemes with their associated connotations and implications can provide a speech recognition program with a much more powerful set of tools to determine not only what was said, but what was meant. This method of understanding how we communicate by looking at meaning is discussed further in the area of pragmatics, the highest level of linguistics.
Pragmatics is the study of the contribution of contextual factors to interpreting the meaning of speech. When a person says that a particular quote was taken “out of context”, he or she is pointing out a pragmatic difference. For people to speak effectively, they need to share a co-oriented background.
Pragmatics can be subdivided into three different areas: indirect meaning, episodes and conversations. Indirect meanings or inferences come from our culture and our relationships. As an example, most people in the North American culture on hearing the phrase “the mother disciplined the child” would be forgiven for making the inference that the child belonged to the mother. People bring in assumptions about culture, such as whose mother typically disciplines whose child. That same inference may not be common among people whose culture values communal childcare. The Sapir-Whorf hypothesis[6] holds that every culture's reality is at least to some degree created by the lexemes it chooses; among other things their choice of lexemes show what a culture thinks is important enough to differentiate (Inuit Eskimos have 30 different lexemes for snow, for example.)
The meanings assigned to the lexeme in a particular culture may be non-literal or nonsensical. As an example, many Americans might answer the question “What is the town so nice they named it twice?” as “New York”, because New York is commonly associated with the song “New York, New York” in most American’s mind, rather than a more literal answer that another person from our culture and a map might give (such as Walla Walla, Washington).
Lexemes can also have implications, where meaning is intended but not communicated directly. Lexeme and word implications can be extremely complex depending on the context of the setting they’re used. Human cultures have developed inference as a very detailed and highly accurate form of communication, since a message can be phrased in such a way to translate a meaning while avoiding a negative emotional response. Any theory that takes into account implications to determine meaning must deal with the wide variance in how people use implications.
As an example, the following expressions are examples a spouse might use to infer that they want to leave a party[7]:
What
time is it?
This
sure is boring.
Don’t
you have to be up early tomorrow?
Are
you going to have another drink?
Do
you think the Millers would give you a ride home?
You
look ready to party all night
Did
you get a chance to talk to everyone you wanted to?
Is
it starting to rain?
It
looks like it stopped raining.
The list could go on indefinitely with significant variation, and could vary by culture, geographic area and even by the distinct interpersonal relationship between two people (a mini-culture of two). All contain different factual information, but have a shared implied meaning of “I want to go now”. This meaning is translatable in the context of a communication between spouses at a party.
The difference between syntax, pragmatics and semantics can be subtle. Semantics is generally restricted to lexemes that remain constant within the same language, while pragmatic meanings vary from context to context. For instance, the concept of 'cat' doesn’t vary across different contexts (there’s no pragmatic difference), but what 'local wildlife' refers to varies from one context to another (there’s a pragmatic difference).
The differences can also be demonstrated by looking at examples of different “failed” sentences and the level where each fail to convey a clear meaning, as in the following:
|
Sentence Example[8] |
Syntax |
Semantic |
Pragmatic |
|
Good John to idea. |
Bad structure |
N/A |
N/A |
|
Odorless noisy perfumes throw tantrums. |
OK |
Meaningless |
N/A |
|
Christopher Columbus was the first man in space. |
OK |
False |
N/A |
|
“Is John a good employee?” “Well, he shows up.” |
OK |
OK |
Indirect meaning |
|
“Bill is one of my two best students” (the professor only
has two students) |
OK |
OK |
Indirect meaning |
Table 2: Examples of sentences that fail
to contain a clear meaning in the areas of Syntax, Semantics and Pragmatics
SAPI does not have any built-in support for dealing with the problems inherent to understanding inference. It would be hard to picture what support, if any, Microsoft or any other manufacturer could provide. The application of the pragmatic level to any speech recognition system is still in its infancy stage. At best, speech recognition engines can deal with very specific pragmatic differences by means of the “lookup table” approach used to determine semantic differences, which may increase the understanding of speech in very specialized environments.
[9]The second area of pragmatics deals with the sequences of standardized utterances that we use in conversation referred to as episodes. Episodes become part of the context for the individual utterance. As an example, Jack and Jill are two university students that have never met each other. In one of their classes they sit close to each other, start talking and recognize that they are attracted to each other. After they get to know each other, someone tells Jill that Jack is going to phone her that evening about a date. That night, Jill gets a phone call from Jack who says “Hey, how're you doing this evening? ". Jill has a preconceived notion of where this is leading – she knows the episode. Finally, Jack says “Can I ask you a personal question? “and Jill thinks “He’s going to ask me out on a date”. Jack says “Do you think you can ask your roommate if she’s interested in going out on a date with me?“. Jill suddenly realizes that this is not an “asking out” episode; this is a “set me up” episode. She has to recalibrate the meaning of everything he said.
Most cultures have a lot of shared episodes. Another example is the difference between the shared episode talking to a salesperson on commission and a salesperson that’s not on commission. What’s going to be said in what order and the utterances that occur are quite different. These are very crucial they allow us to do things on autopilot. If you had to think through our reason through the “greeting” episode would drain as. This is one of the reasons for “culture shock”.
Another example is the difference between the experiences of walking into a store where the sales person is on commission verses another store where the sales person is not on commission - the episodes are different. What’s said in what order are quite different. These notions are preconceived because we share them. They’re very crucial because they allow us to communicate on autopilot. If we had to think through every time what we’re going to say every person we meet, it would drain us. This is the basic cause of culture shock, the disorientation that people feel when they encounter cultures radically different from their own. A person may be able to speak the language in a straightforward way but they have a limited social competence because of the lack of knowledge of cultural episodes.
Conversations go beyond standardized episodes. Several episodes can be happening simultaneously or synchronously together, with some of the episodes not adhering to a preconceived script. People in these situations have to reason their way through the process to determine not only what you believe is happening, but also what you think the other person believes is happening. When people involved in a conversation don’t have a common understanding, the coordination of shared meaning becomes much more complex.
This level of interaction increases the complexity of “true” computer speech recognition tremendously. As an example, if a computer started a conversation with “May I talk with you?” and the response was “How much time do you need?“, first, the computer would have to recognize that the answer to the second question doesn’t directly apply as the answer to the first question, since the form of the answer is not yes or no. One suggested method for a computer to deal with this situation is to recognize that the person asking the question should know the answer, that this is not a real question.
If the response is not a real question, then the computer would need to recognize that this is a situation where the answer to the person’s question is the answer to the computer’s question. The alternative is to note that this question may be off the subject of the original question. The computer must determine the question’s legitimacy. To do this, the computer must not only know that the person speaking knows the answer, but that the person speaking must know that the computer knows the answer as well, in short that this is common knowledge.
One of the suggestions for a computer to handle this situation is to reference a cultural database, where each item in the database has a general co-orientation. Using the cultural database, the computer can determine that a question is a rhetorical question and that the correct algorithm would be to apply the answer to the previous question. The cultural database would be huge. The obvious challenge is to approximate this with a high degree of success.
Technology is rapidly approaching this level, especially if the applied situation is very specific. The computer doesn’t need to know all cultural references, just the ones common to a certain corpus of conversations, such as the shared episodes that are common to counseling session between pharmacists and patients. The more restricted the context, the more this approach becomes realistic.
Humans interact with each other verbally in a very rich, multilayered fashion. Linguistics, the study of these layers, can be used to broaden a computer’s versatility to understand human speech.
The different levels of linguistics vary considerably, especially in modern linguistic theories, based on the purpose of the person studying the different levels. The purpose of most linguistic researchers is different than the purpose of most computer scientists who are studying natural language processing, because the linguists want to encapsulate all knowledge into a universal “theory of speech”. Even if realized, this theory may be computationally impossible to implement. However, if approached in a limited context, the layers of linguistics can significantly increase the accuracy of not only understanding the words a speaker is saying, but also what they mean.
This is the type of grammar that you might find in an information kiosk at either the Olympics or an international pavilion in a major airport. The purpose of such a kiosk would be to provide automated assistance to visitors on a variety of topics. Its semantic grammar is as follows:
kiosk ->
*greeting1 *greeting2 sentence1
| *greeting1
sentence2
greeting1 ->
hello | excuse me | excuse me but
greeting2 ->
can you tell me
| I need to
know
| please
tell me
sentence1->
where destination1 is located
| where is destination1
| where am I
| when will transportation *destination2
arrive
| when transportation *destination2 will
arrive
| what time
it is
| the local
time
| the phone
number of destination1
| the cost
of transportation *destination2
sentence2 ->
I am lost
| I need
help
| please
help me
| help
| help me
| help me
please
destination1
-> a restaurant
| the RestaurantType restaurant
| *BusinessType BusinessName
RestaurantType
-> best | nearest | cheapest | fastest
BusinessType
-> a | the nearest
BusinessName
-> filling station
| public
rest room
| police
station
transportation
-> the *TransportType TransportName
TransportType
-> next | first | last
TransportName
-> bus | train
destination2
-> to metro central
| to union
station
| to
downtown
| to
national airport
This grammar allows you to generate 2664 sentences. Here are
a few of them:
Here is the derivation of one of these sentences, based on
the kiosk grammar:
when will
the next bus to union station arrive
kiosk ->
sentence1
sentence1->
when will <transportation> <destination2> arrive
transportation
-> the <TransportType?> <TransportName>
TransportType
-> next
TransportName
-> bus
destination2
-> to union station
In the kiosk example, only six different categories of
information are requested:
The sort of information that might be returned from the parsing of a sentence using this grammar is as follows:
when will
the next bus to union station arrive
request(schedule(type
: arrival, time :X, transport(type : bus,qualifier : next),location(destination
: union_station)))
The use of the * operator causes the symbol to its immediate
left to be defined as optional. If the symbol is a terminal symbol the operator
causes it to be defined as an optional word. If the symbol is a non-terminal
symbol the operator causes all of the clauses defined by that non-terminal
symbol to be treated as optional. The * operator cannot be used on the left
side of a production rule.
[1] To turn on this interface in Office XP or 2003, start Microsoft Word and click on Tools, Speech.
[2] “You start off with about 70% accuracy … after you have a good voice profile, you can expect … 90% accuracy.” http://www.helpdesksolutions.com/Publications/voice_rec.htm, accessed on 7/23/4.
[3]
From http://www.infj.ulst.ac.uk/nlp/docs/Trees.htm,
accessed on
[4] “It is, however, difficult to get inference of this sort to work for more than a few examples except in very small domains. In general, such high-powered interlingua-based techniques are not used in practice”., Jurafsky, D, Martin J “Speech and Language Processing”, p814.
[5] The example is taken from King, S “The Dead Zone”, 3/27/2000
[6] http://fmc.utm.edu/~lalexand/left_hand.htm, accessed 8/2/4
[7] Knapp, ML, Daly, JA,“Handbook of Interpersonal Communications”, 3rd Edition, SAGE Publications, 2002
[8] http://www.trinity.edu/cbrown/language/distinctions.html, referenced 8/2/4
[9] Conversation with Dr. Villaume on 7/14/2004
[10]
From http://www.infj.ulst.ac.uk/nlp/docs/semantic_grammar.htm,
accessed