An annotated bibliography of natural language and speech understanding systems

Abstract : This annotated bibliography summarizes some 80 papers dealing with various aspects of natural language and speech understanding systems. Most detail a working system which 'understands' 'natural' input. Stress is on those issues at or above the level of syntax. Also included are several overviews and criticisms, usually in the form of comparative studies. (Author)

Introduction This annotated bibliography surveys but a small portion of the literature available in two related artificial intelligence research efforts. It encompasses many (but not all) natural language and speech understanding systems, and a sampling of the related research issues and criticisms.
In comparing speech and natural language systems, contrast rather than similarity appears to be the proper theme. Though all such systems aspire to the common goal of uncontrived man-machine communication, there seems to be only a few system characteristics that are universally shared. Much of the lack of concordance is due, in part, to the fields' youthfulness. A great deal of the research dates only after the ARPA speech study group report [Newell et aL 1971] and Winograd's natural language thesis [Winograd 1972]. Since these two landmark papers, much has been done, but little dogma has developed. As example, all books in this bibliography, but one, are solely collections of separate papers or are monographs on single systems. The exception, a textbook on natural language [Chamiak & Wllks\ begins with the curious disclaimer that the authors "disagree with each other on quite fundamental issues".
Neither youth nor disagreement, however, has deterred the production of a prodigious quantity of research. It is impossible to survey it all. Only working systems which understand natural input are included here. More specifically, this selectional criterion has three parts, each of which eliminates many otherwise interesting and appropriate papers. The stress on working systems rules out purely theoretical work, or those reports that detail proposed approaches or designs.
Emphasis on understanding, as opposed to recognition, culls isolated word recognizers and purely syntactic parsers. A demand for natural input chiefly circumscribes various memory model efforts, which often presume a transducing "front end".
In addition, many other important and pertinent issues are considered to be merely peripheral to this paper. Among these are the volumes of work on the signal processing aspects of speech recognition, much related computational linguistics, the theories and programs for inferencing and theorem proving, and the essential concerns of knowledge acquisition and representation. What remains is a stress on those most communal aspects of complete, interactive understanding systems, namely those issues somewhere at or "above" the level of syntax. To motivate the abounding contrast found between speech and natural language systems, some criticism of the differing philosophies and methodologies of the two fields is also included. It is in these latter domains that the two disciplines seem to differ most sharply.
speaking programs. "Many of the principal researchers change their views on very fundamental questions between one paper and the next." "Criticism and comparisons are best drawn with a very broad brush and a light stroke." Winograd's system presented.
Grammar is a collection of small subprograms: procedures for imposing the desired syntactic structure. Heterarchical organization.
Discussion of Shrdlu: 1) The linguistic system is highly conservative; its syntax and semantics distinction is unnecessary.
2) The semantics, tied to blocks world, is inextensible; blocks world is deductive and closed. One view is that Shrdlu is not about natural language, but about organization of goals. For example, the Planner code for "pick up" is not a sense of "pick up", but a case of its use. 3) Woods and Winograd both agree their formalisms are equivalent: both are grammar based deductive systems, operating in a question-answering environment, in a highly limited domain of discourse. Winograd's programs' Planner "suggestions" are like Woods' arc choices.
However, Woods' position, that an assertion has no meaning if his system cannot establish its truth or falsity, is extreme.
Background issues in natural language: 1) Chomsky's insistence on competence models has isolated generative linguists from any effective test (i.e., performance). 2) "It may well turn out that the most appropriate form for plausible reasoning in order to understand is indeed non-deductive." 3) Procedural knowledge is not as perspicuous as declarative knowledge.
"Second generation" systems reviewed; characterized by the belief that understanding systems must be able to manipulate very complex linguistic objects. They are frame-like systems, which attempt to specify in advance how the world is structured.
1) Charniak: Understanding as pronoun resolution; based on partial (not necessarily true) information.
Information in demons is highly specific (i.e., piggy banks, not containers). Charniak assumes "decoupling": that semantics and applications can be studied independently.
2) Colby: Parry is most used AI program. No syntax analysis; segmentation, then pattern match of segments, using 1700 rules. Patterns are tied directly to responses. Does Parry understand? "Many people on many occasions do seem to understand in the way that Parry does." 3) Simmons: Uses semantic nets of deep case relations, extended by paraphrase rules (e.g. "sell" and "buy" are considered forms of "transfer", etc.). Can be mapped into first order predicate calculus, for inductions. 4) Schank: Based on "dependency grammar" of Hays, has four conceptual categories (noun, verb, adjective, adverb), four cases, fourteen acts. Dictionary entries for verbs can be considered frames, seeking slot-filling items from context. Includes a theory of human mental acts: the representation of "John advised Mary" includes representations of Mary being pleased. Criticizes the stages of development of the system: ". . . consistent process of producing what was argued for in advance. ... At each stage the representation has been claimed, in firm tones, to be the correct one." Some problems: Word sense and prepositional ambiguity not addressed; primitives for only verbs and (possibly) nouns. 5) Wilks system: English-to-French translation task; is "reasonably robust"; based on preference. Templates are of the form: agent-action-object. Prepositions handled by templates of the form: dummy-action(preposition)-object. System never generates a deeper semantic representation than is necessary. Problems: 1) "Codings consisting entirely of primitives have a considerable amount of both vagueness and redundancy" (e.g. "hammer" and "mallet" indistinguishable). 2) Stability under large vocabulary questionable. Claims system is topologically similar to Schank; the heads of Wilks' formulae are like Schank's basic actions. However, Wilks' representation contains, by virtue of his word formulae, more information about what was anticipated.
Summary of the second generation systems. Two research styles apparent: finished product, and the developing system. Comparisons are hard to make due to a lack of precise theory in most systems. Compares them, however, in eight separate dimensions.

1) Levels of representation.
Either language is represented by itself, or by primitives. Colby uses English directly and has enormous mapping problems. The ultimate defense of representation is perspicuity. Plausibility of Wilks' primitives defended by their similarities to the dictionary primitives of Webster. 2) Centrality.
Specific or general knowledge: what leads to greater progress? 3) Phenomenological level. Pursuit of inference beyond "commonsense" is excessive (a comment aimed at Schank). 4) Decoupling. Can the parsing be considered separately from inference? (Charniak uses precoded structures, not natural language). Says no; parsing requires inference, as shown by the success of his and Schanks' semantic-based analyses. 5) Availability of surface structure. Appears sometimes necessary to include it, to preserve word sense (e.g. "nail", "screw", "peg" otherwise indistinguishable). 6) Application: perspicuity of procedures best in Winograd, worst in Schank and Charniak. Strongest objection is with case assignment of prepositions, which is not a mere "implementation issue" to be assumed. 7) Forward inference. As much as possible (Schank), or as little as necessary (Wilks)? Control problems occur with the former approach. 8) Justification of systems. Usually done on the following grounds: a) by the power of the inference system b) by the provision and formalization of knowledge c) by actual performance d) by psychological plausibility. Each system defines a natural language. Question is: How much is it like English?
Conclusion: What is needed is: good memory models, a theory for (multi-sentence) text, and a more sophisticated theory of causation. Also, error recovery from false expectations (as compared with the closed world where all analyses are immediately verifiable). Also, the ability to combine highly specific knowledge with general knowledge. Basic thrusts of Al-based natural language: 1) Theories must be programmable. 2) Theories must deal with language in a communicative context. 3) Theories must formalize and organize knowledge. [Wilks, 1976] Y. Wilks, "Parsing English," Charniak & Wilks, pp. 89-100 & 155-184, 1976.
Two chapters of a textbook, based on the above paper, with addition of considerable detail on Wilks 9 system. Natural language systems divided into content-motivated (e.g. Shrdlu) and structurally-motivated (Student). The former attempts to deal with the three type of natural language ambiguity: word sense, structural (e.g. prepositional) and referential (anaphora). The latter justifies its mechanisms by the function they serve in the problem domain. Shrdlu discussed, with some amplification of mechanism.
Additions to above paper: Parry is easier to extend than most programs, but fragile in that only paranoids are permitted to act the way Parry does. Adds the following to comparison section: 1) Levels of representation. Schank only has primitives for verbs. 4) Decoupling.
"Parsing is essential, so it cannot be decoupled; it defines the significance of the semantic structure." 7) Forward inference: Reiger limits inferences simply by numerical cutoff.
Adds 9) Modularity. Winograd's is a three way heterarchy, while Schank and Wilks integrate syntax and semantics. 10) Scale of representation. "Representations must be justified in terms of some concrete problem that they solve." Large scale frames have so far only been justified by the "plot line hypothesis": that is, stories are only understood vis-a-vis a basic story type (a stance open to debate). 11) Real world procedures. The implicit hypothesis of much work: It is better to concentrate on the representation of human activities we know how to perform; one cannot understand language about activities that one cannot perform. But what is it that the nonperformer does not understand? Outlines Wilks' system: it converts a paragraph of text into a "semantic block".
Processing steps: 1) Fragments sentences at key words, and words are expanded into their formulae.
2) Templates match formulae heads of fragments, followed by preferential expansion of all matched templates. The "most preferred" is chosen, where "preferred" means semantically most interconnected. 3) Inference, if necessary: paraplates rejoin the templates of a sentence. Usually, this means reattaching prepositional phrases, or resolving of some anaphora. Accomplished by semantic filters, one for each sense of the preposition, ordered according to preference.
Pronoun resolution is not a well structured problem; both general and specific solutions are necessary. 4) If pronouns are unresolved by paraplates, then inference.
Two types: action formulae (i.e., verbs) create new templates, by filling out all required grammar cases as if satisfied (e.g. if an action has a goal, assume it has been achieved).
Secondly, templates can generate "common sense" templates; the shortest chain of linked templates is preferred. Thus, the system uses preference at template level, paraplate level, and inference level. Problem of control of inference not addresssed.
Criticizes Riesbeck: his system is based on expectation, and their satisfaction as soon as possible. Riesbeck claims that expectations are unordered. System is based on a "phenomenological fallacy" that assumes that since humans are never conscious of alternate parses, neither should be machines. Note that it is a surface-oriented parser: it is verbs seeking prepositions # (not basic actions seeking cases). Riesbeck has no backup; cannot handle easily constructed counterexamples (e.g. "John gave Mary to the bridegroom.").
A synthesis of several case structure systems.
A formalization of case structure systems in first order logic, in order to formulate, analyze, and compare case structure systems. Includes the use of "case signals" (e.g. prepositions) and "case conditions" (e.g. semantic filters on noun "features"). Causation and purpose are considered "cases". Shows how Fillmore ("every language has a deep case structure" of at least six cases), Simmons (five cases), Schank (four "dependents"), Shows (informally) that semantic nets and predicate calculus are similar, and that semantic nets are to be preferred for computational reasons. However, only shows the equivalence for a subset of semantic nets admitted to be inadequate for natural language understanding. Semantic nets are better for handling "vague" terms like "some".
A summary of the longer frames paper stressing natural language issues.
A frame is a data structure for representing a stereotyped situation. The top nodes are fixed, but the lower levels ("slots") are weakly filled with default values; they can be replaced, but always subject to certain conditions on what can fill them. Different frames of a frame system share the same terminals. Recognition consists of the selection of frames (with respect to goals), and the filling in of slots with data. Claims, after a point, processing is serial with large symbols rather than parallel with much data. Generative grammars are to frame rules as transformational grammars are to frame system transformations. Any type of change can be modelled by before-after frame pairs. A frame also includes as part of its data the most serious anticipated problems associated with the stereotyped scenario they handle.
Frames are connected in a "similarity network". In the network a difference arc connects two frames together; the arc is labelled with the one difference between the frames. Thus, such similarity nets tend to cluster into "villages" centered around frame "capitols" from which their distance is small. Therefore, a stereotype is a capitol; that is, a central representative frame. Suggests that instead of trying to reduce problem space searches, should rather rerepresent the space. Program is a keyword-based simulation of a Rogerian psychotherapist. Input sentences are transformed according to a rule associated with the keyword; handles single sentences only (rest omitted). Program is a simple driver, and a "script" of data (keywords, their rank, and their transformations). Pattern-matches on keywords of input; certain input phrases are carried over into output. Some transformations are mandatory (e.g. "I" -> "you"). Reassembly rules are used sequentially, then reused in course of conversation. Dynamically creates and stores extra transformations to be used when no keyword is present (e.g. "Earlier you said that . . .").

Pattern Matching Systems
Domain was chosen since a psychotherapist is free to assume pose of knowing almost nothing. Success depends on much favorable interpretation by user: "Shows how easy it is to create and maintain the illusion of understanding." Needs a user model; presently is merely a "translating processor". [Bobrow, 1968] D. G. Bobrow, "Natural Language Input for a Computer Problem-Solving System, " Minsky, pp. 146-226, 1968.

STUDENT
A presentation of the algebra word problem solver.
Task is algebra story problems; written in LISP, with some added string processing functions. "Understanding" is taken to be exhibited by question-answering. Surveys several previous natural language programs. Claims to be the first implementation of "discourse analysis" (connected sentences).
Program uses "kernel sentences" and transformations on them. Assumes a naive user model: "What would I have meant if I had said that." Searches for instances of arithmetical operations; all the rest is considered "simple names" of variables.
Solutions depend on resolving anaphora via pattern match, and via global knowledge (mathematical relations on the property lists of key word atoms). Processing consists of tagging of words by function (e.g. operator, or variable), and breaking sentences into kernel sentences by a primitive pattern match on "sentence formats" (i. e. connectives such as ", and"). Operator precedence rules then restructure the equations. One problem: Transformations are strictly order-dependent. [Colby, 1973] K. M. Colby, "Simulations of Belief Systems, " Schank & Colby, pp. 251-286, 1973.

PARRY
An overview of work on belief systems, featuring Parry and a summary of its validation using Turing indistinguishability tests.
Seeks belief systems which are "i-o equivalent", but can have different physical processes. Seeks "parallelism of behavior at some level". Human credibility does not follow strict mathematical axioms.
Outlines three predecessors of Parry. System 1: Neurotic belief. System altered the output of expressions of its beliefs, based on perceived internal conflicts.
Abandoned, as belief base was thin, and there was no way to measure its neurosis.
(Psychiatrists do not agree on "neurosis", but do agree on paranoia). System 2: Normal belief system. Domain is parent-child relations. Includes beliefs and "rules" (relations between beliefs and belief-classes). Data base sparsely related, though large; abandoned due to too much unconstrained search. System 3: Artificial belief systems.
Credibility is assigned to new statements as a function of source, direct evidence, foundation beliefs, and consistency. But bogs down in search through a space of several thousand beliefs.
Parry is a simulated individual with a fixed set of malevolent delusions. Contains a context-free semantic grammar of "perceived intentions" of interviewer, which can be malevolent, benevolent, or neutral. Also has "flare" concepts which activate the delusional complex.
Input is classified by the grammar, and 1) internal values of affect (fear, anger, mistrust) 7 are modified, and 2) output is produced (counterattack if angry, withdraw otherwise). Beliefs are here procedurally encoded as internal and external responses.
Input is based on key words and rewrite rules: words are mapped into conceptual classes. Clauses and some other linguistic phenomena not handled. Hard part is input strategy: when to pursue current context. Heuristics are used; for example, if no new topic has been mentioned, look for an extension of previous concepts. Fear and anger are fluid, mistrust is not; simple mathematical formulae modify their values. Key word understanding simulates paranoids' ignoring of context when flare words occur. Also, paranoids are rigid, like the program. Uses canned responses of sentence length, some with variables that can be assigned to flare concepts.
Validation of model by means of Turing indistinguishability tests (reported below). Asserts the chief challenge is the widening of the scope of the model. [Colby et aL, 1971] K. M. Colby, S. Weber, and F. D. Hilf, "Artificial Paranoia," Artificial Intelligence, Vol. 2, pp. 1-25, April, 1971.
A paper similar to the above, with some added detail on the semantics of the system.
Simple input is assumed; compound and complex sentences not handled well. Uses a keyword-based mapping of input into predications on an attribute of an object, or predications on a relation of the object to another object. "A combination of "you" or "your" with some form of the attribute, plus optionally another object or assisting concept will adequately convey the meaning." Data base is ordered so that object concepts occur before attribute concepts (distinguishes "your parents' residence" from "your residence"). Conceptual classes contain differing parts of speech ("work", "occupation") for ease in pattern matching. Uses a special scanner for specific grammar-based items: "I", "you", "me", metaverbs (e.g. "think"), positive or negative attitude tokens, passive forms, etc. List limits of approach: 1) Control flow is primarily syntactic; a heterarchy of syntax and semantics is more psychologically plausible. 2) Only a primitive use is made of context and of discourse rules. [Miller, 1975] R. L. Miller, "An Adaptive Natural Language System that Listens, Asks, and Learns," IJCAI4, pp. [406][407][408][409][410][411][412][413]1975.

Miller
A Learning natural language program based on the microworld of ticktack-toe.
Plays tick-tack-toe; uses contextual evidence, and asks questions of user, to determine the meaning of new term. Similar to speech acoustic error: linguistic errors are corrected using "higher level knowledge". Has fixed semantic concepts, but learns new descriptions of them. Carefully lists the program's limitations.
Levels of processing: local syntactic, semantic clustering, cluster expansion and connection (finds unknown words), contextual inference (possible only since the class of semantic primitives is very small). Claims methods are domain-independent. Syntax used as an aid in semantic clustering. Utilizes surface "frames" for each concept, containing the verb and its necessary verb cases. Meaning of unknown words are deduced by best match to frames available. Keeps a process history to answer "why" questions; the history acts as a semantic filter on new terms, also, by limiting interpretations. Clauses that are known with certainty help resolve uncertain ones in the same sentence, by establishing a board position, for example. Sufficient restraints in the resolution of unknown wocds can be coded because of the complete knowledge of the domain. [Woods, 1970] W. A. Woods, "Transition Network Grammars for Natural Language Analysis," Comm. ACM, Vol. 13, No. 10, pp. 591-606, October, 1970.  [Heidorn, 1975] G. Heidorn, "Augmented Phrase Structure Grammars, " Schank & Nash-Webber, pp. 1-5, 1975.

Describes
Describes a parsing and generating scheme independent of, but much like, augmented transition networks.
Traditional phrase structure rules are augmented by conditions and structure building actions; the data structures allow consistent decoding and encoding of natural language.
Word "records" are sets of attribute-value pairs (like LISP atoms).
"Segment records" are used for segments of text, and are joined together via encoding rules.
Encoding rules are productions: left side matches segments, and right side prescribes new segment records. Rules match on equality of record attribute values.
Transformations consist of setting attributes in new records to either new values, or pointers to records. Thus, it can incorporate semantic relations via prestored database records.
Most analysis rules are semantic-based. Decoding is basically left-right, bottom-up.
Expectation (backup) is handled by "rule instance records" which can be extended: breadth-first search. Decoding is also handled by production systems. But, some care is necessary to handle the ordering of productions. Present system has 300 records, 800 rules. Task is to construct a GPSS simulation program from an English description of a simple queueing problem. Claims it is similar to augmented transition networks. predicates), and conceptual (can be seem as a "deep structure").
The transformation from string to net is via an augmented transition network; however, the actions build up a semantic net, rather than phrase markers. English sentences are generated from the semantic nets as follows. Input is a semantic structure, together with a list of the desired constraints on modality (e.g. generate a question about the theme). After selecting a verb paradigm pattern based on the constraints, the pattern is input to an augmented transition network. Arcs generate output by computing actions based on the pattern and the input structure.
Answering questions: These "semantic" nets only abstract the syntactic relations from sentences. No attempt is made to abstract lexical equivalences (e.g. "lose", "defeat"). Thus, needs paraphrase rules in order to handle mapping of the cases from one verb to a synonym (e.g. "He was defeated", "He suffered defeat"). Some such rules need many conditions to allow the map; usually, these are words with many senses.
Paraphrase is accomplished using augmented transition networks: each paraphrase rule becomes a small augmented transition network. However, several other programs for pattern matching are also necessary (and are given in an appendix). Questionanswering is done by matching the case tokens of the input with case tokens of stored assertions, or their paraphrases. A large database thus requires that each word list all the structures it appears in, as well as all the structures it can be paraphrased to.
Notes that paraphrase can be recursive, and combinatorially endless.

An exposition
on the philosophies of semantic primitives, and the methods for judging their effectiveness.
Schank' and Wilks' are the only systems with semantic primitives. Schank's is mixed: has primitives, plus English words. Claims such surface words should be allowed only if defined in primitives, perhaps "reentrantly", as in a dictionary. Adopts the new view that all primitives are a micro-language, that is, a natural language in themselves, with all the natural language problems. Thus, no justification on basis of size or composition of vocabulary is meaningful (as it would not be with English itself).
Ultimate test of a primitive system is the performance. Compared his list of primitives with the SDC dictionary, which listed frequency of words used to define other words.
Agreed approximately, up to the 80 or so primitives he used. One intuitive test of a good primitive choice: does it allow for interesting semantic generalizations?

An exposition of the analysis portion of a semantics-based
English-to-French machine translation system.
System is based neither on linguistics, nor on theorem proving. Mechanical translation of English to French is a major test of semantic understanding. Justifies lack of a deductive system by claiming that it is false that "principles of logic play an essential role in our description of the world." Uses, instead, "commonsense" inference rules, which also are input to the system as English sentences.
System consists of "templates" bound together by "paraplates" and inference rules.
All three data types are composed of about 60 "elements" (semantic primitives). Input words are replaced by "formulae", which are binary trees of semantic primitives.
System uses preference rather than semantic restriction. Templates are basic actoraction-object triples which locate the "usual conversational" kernel messages implicit in a sentence (e.g. "He is good" is a form of the template "Man Be Kind"), and disambiguate word senses. Templates are stored in a BNF. "Only defense of choice of primitives is that a system actually works." Analysis of kernel sentences proceeds as follows: Words are expanded into their stored formulae and formulae "heads" (prime primitives) are used to select a subset of templates.
Templates are expanded by substituting, for their three elements, the formulae of all words in the sentence which contain those elements as their heads.
"Density" of preference satisfactions (i.e., number of matching elements) within templates indicates proper parse. System makes no syntax-semantics distinction. It first fragments a paragraph of text by keywords into kernel sentences, expands them, resolves anaphora by "tie" routines which apply "paraplates" (semantic filters) between kernel sentence templates.
Paraplates, which resolve prepositional modification, are ordered by semantic density of content: the most specific senses of the prepositions are tried first. Inference rules are tried only when paraplates fail to resolve anaphora. Inference is used to predict "missing" templates; shortest chain of missing connected templates is the best. Claims this methodology is superior to that of deductive programs, which work best on puzzles but not on natural input, the latter being based on preference semantics. [Wilks, 1973b] Y. Wilks, "The Stanford Machine Translation Project," Rustln, pp. 1973.

Outlines both the analysis and generation parts of a semantics-based
English-to-French machine translation system; amplifies previous paper.
Opening remarks: Logical (predicate calculus) versus linguistic intermediate languages is not necessarily a conflict; the two representations reflect "two levels of human understanding". No strong syntax is necessary in the system. Uses semantic templates of form subject-verb-object, where some parts can be dummies (e.g. prepositions are considered "pseudoverbs", and have templates of the form dummypreposition-object). Assumes a finite number of templates are adequate to represent "most" of "ordinary" English.
Fragmentation of input text is at punctuation, subjunction words, conjunction words, and prepositions. The final semantic representation consists of tied templates, rather than hierarchical structures. Does not claim universality of templates: "No inventory of templates can be proved to be correct." The French output dictionary is a list of pairs: a semantic form coupled with a French "stereotype", which contains implicit generation rules and actual French words.
The generation rules test case conditions and sometimes, as in the case of objects of verbs, search for other form-stereotype pairs. The most specific stereotype is always preferred.
The basic stereotype search is augmented by "concord" and "number" routines to handle the French inflections. Much procedural knowledge in stereotypes, however: "halt-points" in stereotypes prescan for special cases of word usage, to "handle linguistic idiosyncracy". In general, the more irregular the word, the more special information is in the stereotype, and less in any related modifying stereotypes.

An early version of conceptual dependency and its analysis of sentences.
Claims communication, not grammaticality, is key issue in natural language.
Expectation is a major element in understanding. Lists six types of expectation: sentential, conceptual, contextual, conversational, individual memory, cultural memory.
Outlines conceptual dependency theory, and its primitive conceptual acts. "Syntax is . . . a searching mechanism for already known semantic information." A primary problem is finding the verb; the system uses syntactic and conceptual heuristics.
Major problem of analysis is "extracting the presupposed information implicit in an utterance." Analysis uses one stereotyped, general implication chain of verbs to help fill empty conceptual slots in the conceptualization being built.  R. C. Schank, "Identification of Conceptualizations Underlying Natural Language, " Schank & Colby, pp. 187-247, 1973.
A detailed presentation of the fundamental theories and structures of conceptual dependency.
Seeks a representation of meaning in an unambiguous, language-free manner.
Syntax is not enough (e.g. "John's love of Mary was harmful." versus "John's can of beans was edible."). A natural language understanding system should never find more than one meaning at a time, as is the case with human linguistic expectation.
Sentences are mapped into conceptualizations consisting of nominals, acts, and modifiers. Acts are broken into primitives, to aid in paraphrasing. There exist basic conceptual rules for attaching various links and modifiers to the conceptual graph (tense, etc.). The conceptual level has its own syntax of permissible constructs and its own semantics of selectional filters.
The primitives of the theory include: relations of nominals (containment, location, possession), and conceptual cases (objective, recipient, directive, instrumental). The COnCeDtual racAC r\ema~*-i A even ime, and location. Notes that many verbs are descriptions of the relations of unknown actions (e.g. "prevent"), or the resulting states of such (e.g. "hurt"). Conceptual verbs (like "think") are handled by positing a "conscious processor" and "long term memory", to and from which conceptualizations are transferred. Physical actions have six basic primitive acts (e.g. move, ingest, etc.). Thus a total of 14 acts; inference rules are therefore not many in number.
Examples of (hypothetical) parses by machine.
Conceptual semantics eliminates troublesome parses; gives several examples of both syntactic and semantic ambiguity and similarity. Summary: The theory is based on the "moving about of ideas or physical objects." [Schank, 1975b] R. C. Schank, "MARGIE, The Conceptual Approach to Language Processing, and Conceptual Dependency Theory," Schank, pp. 1-82, 1975.
The introduction to the three collected Margie theses; surveys conceptual dependency and its implementation.
The system inputs, paraphrases or infers, and outputs English sentences. Margie is a specific attempt to "model human psychological processes" through language-free meaning representations; language and thought are considered separable. Claims the best conceptual base form is the one which expresses the most information explicitly.
Analysis is based on expectation. "Semantic rules are preference rules that select the best syntactic combinations." Claims that the meaning representations that make inference easiest are probably the best.
Reviews conceptual dependency theory. Theory also contains several primitive physical and conceptual states (e.g. "joy"). Many examples of conceptual dependency graphs given; admits many sticky issues are unresolved. On inference: "The real meaning of a primitive act consists of the inferences that are likely to be true when the act is present." Each act generates its own set of inferences, both forward (i.e., consequences), and backward (antecedents, though this is generally harder). Inference is simplified considerably by the use of semantic primitives. A terse summary of the Margie system*s three components.

2.S.2.1 MARGIE
System operates in either paraphrase or inference modes. Output module uses Simmons' program, with modifications. Reviews the conceptual dependency theory.
Analysis uses syntax only when all else fails; processing is highly specific to verb.

A summary of the thesis on the analysis portion of Margie.
Introduction: Role of syntax is small. No clear division kept between linguistic and non-linguistic knowledge.
Basic orientation: "The sentences understood are about human behavior." Analysis based on conceptual expectation. Admits of ad hoc approach: "The process of taking an example and expanding the vocabulary to handle it was the basic means of growth in the analyzer." Since the code is LISP, this usually had a procedural effect. Analyzer is a program monitor plus dictionary of about 60 verbs.
Overview: As a word is scanned, it adds requests to a request list. The request list is checked to see if any of the requests' conditions are satisfied. If so, then their associated programs are executed. Example: "John gave Mary a beating." Notes that each word can have several senses which must be distinguished. Analyzer has no backup; attempts to understand "while the sentence is being read." Claims it only has to worry about semantic ambiguity; semantics subsume syntactic ambiguity. Thus the analyzer only ever produces a single parse. Time is handled only in relative terms ("before", "after").
Overview of expectations and their associated programs ("actions"): They are much like augmented transition networks. Actions can modify almost everything; expectation (conditions) can be dependent on almost everything. Semantic features of a word are represented in conceptual dependency notation. Some syntax in the analyzer: three surface cases of subject, object, and recipient, determined merely by word order. No prepositions are ever considered, and noun pairs are not handled ("kitchen table").
Semantics of nouns are handled only superficially; stress is on verbs. Example: "give" has in its definition that the recipient is the first human noun phrase after the verb, and the object is the first physical object.
Multi-sentence analysis: admits of its inadequate treatment.
Expectations are created between sentence pairs. The first sentence establishes preferences for the senses of a predefined class of verbs, in order to disambiguate them in the second (e.g. "John and Mary were racing. John beat Mary."). Notes that the only conjunction allowed is that between noun phrases. [Riesbeck, 1975b] C. K. Riesbeck, "Computational Understanding, " Schank & Nash-Webber, pp. 15-19, 1975.
A review and second look at the Margie analyzer, with future suggestions.
Claims comprehension is a memory process: basically simple mechanisms, with large data bases organized by key concepts. Sentence analysis is based on expectations.
Admits to "no good control over set of expectations." Therefore, is planning extensions to program. One is labelling expectations as to purpose, in order to delete them when no longer valid. ("Purpose" is the case slot to be filled). Also, will incorporate dependency information between expectations: what case slots are prerequisite to an expectation. [Rieger, 1975] C. H. Rieger, "Conceptual Memory and Inference," 1975.

MARGIE: Inference
A summary of the thesis on the inference portion of Margie.
Introduction: All inferences are spontaneously generated. "This theory does not extend into the domain of deciding what is appropriate to say." Representation: Design criteria include language independence, and a psychological orientation. All concepts are stored in a fully-inverted data base for easy access. However, use of semantic "is-a" relations not well defined, mostly due to a lack of a taxonomy for nouns. Short term memory is simulated by "recency" tags. Beliefs and fact are distinguished by "truth" and "strength" tags. Inference chains are maintained together with "reason" and "offspring" lists. Real world knowledge is represented by patterns weighted by probability, which are matched against (e.g. "ingest person meat").
Inferences: Claims there is much subconscious, spontaneous (goal-less) inference to every stimulus; admits this psychology is "naive". The inferencing attempts to form "interesting" new relationships, in the manner of Quillian's expanding spheres.
Contrasts his form of inference with 1) inference at question time, as in Planner data bases, 2) demons of Charniak, and 3) theorem provers, which have no analogue of his fuzzy logic. Inferences confirm, contradict, or augment existing knowledge.
Mainstream inferences: 16 types, only six of which are detailed. Inference needed since language tends to be as economic as possible. 1) Specification inferences. The filling in of obligatory conceptual cases with specific objects, mostly by problemspecific heuristic programs ("inference molecules"). Returns, also, a "reasons" lists, which allows interrogation of cause. 2) Causal and resultative inferences. Two types of inference: "cause" and "cancause", the latter being highly data-sensitive. Allows forward and backward causal chain expansion between two input conceptualizations, seeking a common intersection conceptualization. 3) Motivational inferences. Assumes "every real world action might have been volitional"; however, the motivation is inferred only if the actor could know about the results of his act. Special-purpose "normality molecules" rate the plausibility of generated motivations. 4) Functional inferences. When an object is wanted, its intended use is inferred.
Unresolved are problems of knowing when to infer the more specific of functions (e.g. a newspaper used as a fly swatter). 5) Action prediction inferences. Inverse of motivational inference, via molecules attached to each central act. Calls specification molecules to flesh out the predicted actions. "Illustrates how very sensitive all inference molecules must be to features of the objects involved in their inferences." 6) Utterance intention inferences. "I can't X" is really a request for X, etc. Open problem: how to handle the inferences that are derived from superfluous information (e.g. "Don't eat green gronks").
Inference-reference interaction: Problem is to disambiguate nouns. Process can involve arbitrarily many inferences, and the order of inferencing with respect to reference establishment varies. Solved by the creation of a temporary concept that is the intersection of the features of all possible referents. Inferencing now occurs, and the new inferenced information is checked against all candidates; the best match is the referent. Occasionally, normality molecules will aid inferencing in selecting the best candidate by making "most likely" inferences. This handles reference only locally, but claims the mechanism is general enough to work over several story lines. [Goldman, 1975a] N. Goldman, "Conceptual Generation, " Schank, pp. 289-371, 1975.

A summary of the thesis on the generator portion of Margie.
Introduction: Designed to be task and domain independent; used in Margie paraphrase and inference modes, and also to generate German output (machine translation). Overview: Word selection (mostly verb selection) is first step. Each verb has predicates ("defining characteristics") associated with it, which must be satisfied before the verb is chosen. Predicates may range over several conceptualizations in the input, or in the world model (e.g. "gave" versus "returned", "threaten" versus "promise"; depends on the "conceptual context"). Overriding philosophy: "A good generator will maximize the amount of structure encoded in the words it chooses." Second step is syntax representation. Words are tied into syntactic networks of a weaker form than Simmons'; they have no conceptual significance. These networks determine the grammatical transformations (infinitive form, etc.) and word order. Each verb has associated with it the appropriate skeletal syntactic net. Augmented finite state transition networks produce the output.

Fine
structure of the generator: Verb selection is via discrimination nets. Discrimination trees are binary trees with predicates at each node, which further specify the path to be taken to the terminal node specifying the output.
The predicates check various "fields" of a conceptualization, by pattern matching and by inquiries into the world model. Some of these inquiries require deduction, which is not well handled. The discrimination trees are actually discrimination nets, as they have cycles to allow backup; they are hand crafted to prohibit looping. Fifteen nets, one for each major verb category. Admits of incompleteness of the nets, and of conceptual dependency itself.
Once verb is found, there is a pointer to a "concexicon" entry which holds the basic syntactic framework, plus programs for filling it. Scales of relative amounts are used for adjective selection. There are seven scales: health, joy, anger, excitation, physical state, size, and certainty; admits they are ad hoc. Language-specific functions are necessary to add language-specific information to the syntactic nets: for tense, determiners, possession, form (e.g. progressive tenses), and mood and voice (which are not actually handled).
An augmented finite state transition network generates output, based on Simmons' programs. Uses three separate "constructors" for verb strings, noun phrase strings, and sentences.
Strictly a performance grammar, and admits to it being limited.

A review of some open problems in language generation.
Few have addressed the problem of "what constitutes a context requiring a natural language output." Most concern is with the representation in syntactic structures, in semantic nets, or in conceptual nets, or with the contextual effects on the utterance produced. Claims the assumption of single-sentence output is oversimplified. Reviews thesis work: representation is free of actual "words" and syntax, both of which must be reapplied. Notes that the conceptual nets have been designed to aid inference, not analysis or generation. Asserts that a model of the intended recipient's present state of understanding would aid generation greatly, but none exists yet. [Schank, 1975a] R. C. Schank, "The Structure of Episodes in Memory," Bobrow & Collins, pp. 237-272, 1975. Although paragraph understanding is not implemented yet, asserts that understanding "is, in large part, the assigning of new input conceptualizations to causal sequences and in the inference of remembered conceptualizations which will allow for complete causal chains." [Abelson, 1975a] R. P. Abelson, "Concepts for Representing Mundane Reality in Plans, " Bobrow & Collins, pp. 273-309, 1975.

Scripts
An outline of a system of primitives for expressing abstract state changes.
Concern is with belief systems; is conceptually close to Schank. Major parts of theory are: scripts (stereotyped action sequences), themes (related scripts), and dremes (attitudes toward themes). Cites the contrast between systems dealing with small worlds of complete knowledge, and systems dealing with big worlds of scattered knowledge.
Favors the domain of political ideologues as a compromise. Theory primitives are nine "delt-acts", that is, acts which affect a change: for example, deltproximity, delt-quality.
Primitives are much higher level than Schank.
Plans are sequences of desired state changes. Some problems remain: time passage is not formalized, and goals are not formalized.  R. C. Schank, and R. P. Abelson, "Scripts, Plans, and Knowledge," IJCAI4, pp. 151-157, 1975.

Presents
a theory for understanding stereotyped and/or purposeful human activity.
Claims eventual limit to natural language understanding is the ability to characterize world knowledge. Defines understanding as "the fitting of new information into a previously organized view of the world". A script is a stereotyped sequence of actions in a context. There are many of them; some interact. Actions are linked by "causal chaining". The most interest, however, comes from deviations from the script. Script headers define the circumstances which fire the script. "What if" parts of the script handle obstacles or error. Reviews the program Sam: it instantiates a script, and makes inferences to complete causal chains.
Plans are a sequence of actions to realize a goal; they are infrequently used scripts. Composed of five primitive "deltacts". Each plan has a "plan box" associated with it which lists actions that achieve the goal; this list enables inferences. Pam, planned, handles plans. Claims "good forgetting is the key to remembering." Proposes to remember only a (non-script) event list, a goal list, a plan list, and a "weird list" of script deviations. Plans to "normalize" scenarios by replacing event lists and plan lists with pointers to "prototypes". [Klein, 1975] S. Klein, "Meta-compiling Text Grammars as a Model for Human Behavior, " Schank & Nash-Webber, pp. 94-98, 1975.

Outlines
a very ambitious theory of human understanding, learning, and language.
Text grammars generate stories, somewhat like a script. Major concern, however, is with behavior transmission across generations. Wants to simulate the understanding, incorporation, and transmission of grammatical knowledge through simulated consciousnesses.
Grammars are to be transmitted through example, inferred, and corrected through various interactions. Claims "it is the concepts of time and metacompiling that appear to be the fundamental aspects of human cognition." Example program creates many folk tales from a text grammar.
2.5-3.1 SAM [Lehnert, 1975] W. Lehnert answer "what happened when" and "why" questions. The latter can be script-based or not, though only the script-based ones are well handled, by using both the \err\pors\ organization and goal sub-structure organization of the script. The script thus directs inference by focusing on variability. Claims system shows the power of episodic organization of knowledge, although it also incorporates semantic knowledge in the conceptual dependency framework.  E. Charniak, "Inference and Knowledge, " Charniak & Wilks, pp. 1-21 & 129-154, 1976.

2*6« Inferencing Systems
Two chapters of a textbook, exploring the "narrower question of how knowledge is used to make inferences"; includes much of the second and third papers below.
Analyzes several systems of inference according to five aspects: 1) semantic representation used, 2) mechanism of inference triggering, 3) organization of programs and data, 4) inference mechanisms themselves, 5) content of the knowledge represented. First order predicate calculus and Planner are examined with respect to the above five criteria. A primary question is: When are inferences made, at question time or read time? Claims there is general agreement that some must be done at read time.
Further question about read time inferences: How many should be made?
Distinguishes "problem occasioned inferences" (to resolve anaphora), from all else ("keeping up" with a story). Claims non-problem occasioned inferences must be made, too.
Reviews his own thesis work, McDermott's Tople system, and Rieger's portion of Margie. Criticizes Reiger for his use of single sentences, simple actions ("hit"), and an unrestrained amount of inferences. The five criteria are applied to Charniak's thesis: 1) non-primitive semantics, 2) read-time inference triggering, 3) Planner procedures, 4) inference by demons, 5) no claims about content. Also applied to Reiger: 1) conceptual dependency representation, 2) many read time inferences (16 types), 3) organized by inference and normality molecules (similar to Charniak's base routines and fact finders), 4) procedural inference, 5) no claims on content.
Frames, as applied to natural language, are reviewed. There are four basic types for language: syntactic, semantic, ones for stereotyped events, and ones for communication conventions. Claims Schank's scripts are frames. [Abelson, 1975b] R. P. Abelson, "The Reasoner and the Inferencer Don't Talk Much to Each Other, " Schank & Nash-Webber, pp. 183-187, 1975.

Some reflections on the philosophies of inference, and their problems.
Claims reasoning is formal, but inferencing is "commonsensical"; the two may be the same, though no one knows. A distinction is certainly true for humans; concrete information is used in favor of statistical, and the two types don't seem to combine readily. Gives interesting (human) examples, and asks if AI should simulate the dichotomy. Some methodological comments follow. A problem with AI is its diversity of problem contexts; claims there is a "tacit agreement that it is OK for everyone to define his own area." But by using his intention primitives, which represent state changes (nine "deltacts"), he can show similarities between the supermarket frames of Charniak ("fetching food") and the table top of Winograd ("fetching blocks"). However, claims that these primitives may be at too high a level, and not detailed enough, to actually use.

A summary of some of his thesis work on inference.
Major concern is the organization of common sense knowledge to answer children's stories. Not strictly natural language understanding: sentences are hand-encoded.
System flow: When a given topic is explicitly mentioned, its associated "base routines" set up "demons" which lie in wait for related events to occur in the following text.
One problem: how to remove old demons (which may fire inappropriately, causing "misunderstanding", and which are inefficient). Inadequate solution is to remove them after N lines. System also includes "bookkeeping" routines to handle temporal relations, and "fact finders" to use standard inferencing (Planner) techniques. Some situations, especially those with both motives and results, are best handled by having demons call up other demons; each demon is a different abstraction of the situation.

Presents a complete (theoretical) reworking of his theory of inference.
"Understanding a line of a story is to see it as instantiating one or more frame statements" of a frame. Gives several case analyses of frame problems using the scenario of shopping. A key problem: Given a statement, which frame statement is instantiated? Which of the frames themselves are active depends on "key concepts" "triggering" a frame; frame is then searched for a frame statement matching the story statement. Claims approach is better than demons: frames are more general, and can be used in multiple ways. For example, if frame statements are considered states to be achieved, they can be used to problem solve. Some additional problems: How many frames should there be, and how much is shared between them? Thus, in his example, the frame for shopping is augmented with a frame for a carry-cart, and common frame statements ane shared via reference pointers.
The question of "read time" versus "question time" inferencing is not as serious as the problem of which inferences should be made. His answer: those inferences which serve to link frames (i.e., those that serve two purposes: e.g. completing a subframe, and filling in a frame statement in the main frame). Formally abandons the demon approach. One major problem with it was that topics had to be explicitly mentioned (not inferred). Claims that frames can handle the passage of time better, as they have room for the inclusion of "progress pointers" tracking the achievement of frame (script-like) events.

A description of the Scholar program, and an investigation of various aspects of human semantic information and inference.
Major concern is the representation of information "in ways that are natural to people". Vehicle is Scholar program with "mixed initiative"; that is, is not merely a question answering program. Domain is computer aided instruction. Uses semantic nets, with hierarchical structure and "importance" ratings on the information content of nodes. Characterizes natural semantic information as 1) fuzzy (e.g. "large"), 2) incomplete, 3) contextual (handled in the system by checking the "importance ratings" of terms referenced by the speaker: nonexperts tend to stay at high-importance levels), 4) in an open world (the problem is when to say "I don't know" if knowledge can not be exhaustive), 5) with vague truthfulness ("often true"), and 6) vague quantification ("some"). Uncertainty is handled with "uncertainty ratings". Natural inferences are 1) deductive (using hierarchy relations), 2) negative (inferred through contradictions), 3) functional (i.e. procedural: e.g. using latitude to predict climate), and 4) inductive (not well understood). [Brown et al., 1975] J. S. Brown, and R. R. Burton, "Multiple Representations of Knowledge for Tutorial Reasoning," Bobrow & Collins, pp. 311-349, 1975.

SOPHIE
A description of the multiple knowledge sources, and problems, of the Sophie system.
Task domain is the computer-aided instruction of fault-finding in transistor circuits.
Uses many types of knowledge: simulation, heuristic "procedural specialists" for various circuit components, and semantic nets (for static information). Input is parsed according to a semantic grammar. Grammar is also used to handle anaphora (semantic classes are used as filters on possible referents), for deletions, and for ellipsis.
System is specifically designed for real time usage. An event history list also helps resolve ellipsis. System exploits the fact that inference in this domain can be achieved by (heuristic) simulation. Can even determine, by using resolution theorem proving, which requested circuit measurements would not add to the student's Knowledge of the circuit problem. Claims: "Computational linguistics has yet to find its paradigm," since it was difficult to find a good framework in which to analyze some 200 actual dialogues. Calls for more empirical research in natural (not written) discourse.

Presents a list of eight design criteria with which understanding systems can be judged, and presents the Merlin system.
Task of Merlin is the understanding of AI, through the understanding of AI programs.
Definition of "understand": "S understands knowledge K if S uses K whenever appropriate." Notes that the presence of knowledge can be investigated directly in computer programs. "Appropriate" defined as "goal-serving".
Understanding is difficult to test, as it requires a diversity of tasks. "Understanding may be partial both in extent and in immediacy." However, one possible test of understanding may be the understanding of natural language, which implies much understanding at large. Another test would be to satisfy a taxonomy of functional specifications that any understander is required to have; however, no such taxonomy exists.
In lieu of such a taxonomy, a taxonomy of design issues is proposed. Dimensions are: 1) Representation: with associated problems of scope, grain size, and multiple representations.
2) Action: output, and evocation of executable procedures.

3)
Assimilation: input, and structuring of environment to existing representations.

4)
Accommodation: the building of new internal structures, rather than the instantiation of old ones. 5) Directionality: goal directedness and "keep-going" ability. 6) Efficiency: including the possible problems of interpreters, general methods, and highly formal systems. 7) Error handling: including the "frame problem". 8) Depth of understanding: Merlin itself uses beta structures to understand. A beta-structure is recursively defined as: "<*: [/? ul oc2 . . .]". That is, V can be viewed as a (i> if it is further specified according to odl, OL2, etc.". Beta structures form a hierarchical knowledge net; however, the system does not make any deliberate generic-individual distinction. Structures can be mapped to one another. That is, beta-structure X can be viewed as a mapped version of beta-structure Y. This mapping is more powerful than general matching, since it can invoke the knowledge net hierarchy, and reinterpret any constituent beta structure. Merlin's use in problem solving: a problem is solved by attempting to see the current situation as a goal, and performing the necessary mapping.
This imposes problem-solving mappings on the current situation's constituents.

Poses
and answers three common objections to natural language understanding research.
Basic methodological disagreement: "Is there a science of language?" Three arguments: 1) Concerning theory and practice: "More theory is needed." Answered by: success in a task is the best test of a theory, not the theory's intuitive appeal. 2) Concerning AI and science: "Approximate success won't do." Answered by: AI is engineering. Easily constructed counterexamples do not, as in physics, overthrow what has been formalized. Due to nature of language, there is no boundary to natural language understanding, so no complete theory is possible. 3) Concerning where to start: "First need a theory of reasoning." Answered by: if so, then no one can understand anything unless he understands all. [Mann, 1975] W. C. Mann, "Improving Methodology in Natural Language Processing, " Schank & Nash-Webber, pp. 140-143, 1975.

Suggests many
ways in which natural language understanding can be made more of a scientific endeavor.
Claims: "The style of research is the least flexible of precedents." Thus, natural language faces two problems: rigor and complexity. Parodies current research: select a phenomena, an input form, and an output form; code it; debug it on "examples of opportunity"; publish. "The activity is often treated as programming . . . rather than science." One problem is that the unit of production is the system, instead of the algorithm. Another is that the analyses usually center on only one of the processors in the intrinsically two-processor communication situation. Suggests the case analysis approach: data acquisition, phenomenon identification, case modeling, and model evaluation against the original data corpus.  W. A. Woods, "A Personal View of Natural Language
An essay on what things are still required for a good natural language understanding system.
A good natural language understander must adequately handle: anaphora and ambiguity, quantification, adjectival and relative clauses, adverbs, conjunction and negation, time and tense, and paraphrases. Stresses the need for "practical theoretical solutions". One unresolved problem: The knowledge formulation must be flexible enough to allow eventual "closure", naturally. How to measure success and progress is difficult: there is no taxonomy of linguistic phenomena, and "perspicacity" of a system or a method is difficult to quantify.
A review of various issues in speech understanding research.
Speech understanding as a research endeavor started about 1956. Its "dogmas": 1) The one performance criteria is understanding the message.
2) All sources of knowledge must be used.
3) The speech signal alone hasn't enough information.
Outlines the structure of the task: "At present, there is no universal representation of meaning." Knowledge sources' knowledge is similar to linguistic "competence". Some mechanisms for converting knowledge to action: partial knowledge representations, combinatorial spaces, generative to analytic representation conversions, time versus frequency representations, matching algorithms, control of focus, multiple knowledge sources. Systems can be specified by ARPA's 19 dimensions, and by the system structure (hardware) and knowledge sources required.
Performance evaluation important: recall that the goal is a speech front end, not a system in itself. Can evaluate systems using benchmarks, operation research models, analysis of algorithms, null models (e.g. Dragon, Tech: relatively straight forward), optimal models (few exist), ablation studies (requires decomposability), analysis of variation, causality analysis (i.e. traditional debugging). Cites two tensions in the field: 1) interdisciplinarity, 2) general versus knowledge-specific mechanisms. Eventual scientific payoff includes: 1) understanding of human speech understanding, 2) formalization of influences on speech signal, 3) AI's first multiple knowledge source system, 4) disproof of statements that machines recognize with difficulty, 5) reinstrumentation of speech research. One practical payoff: can speak to computers.  D. R. Reddy, "Speech Recognition by Machine: A Review," Proc. IEEE, Vol. 64, No. 4, pp. 501-531, April, 1976. Reviews several systems and their components, pointing out future directions for each.
Ail current speech understanding systems are "restricted speech understanding systems"; the restriction is the necessary use of task-specific information.
Little common data, so comparisons between systems are difficult.
Connected speech recognition: difficult, since word junctures are not clear, and pronunciations vary with context; an "analyze and describe" paradigm is necessary, since the data is combinatorially large (no pattern recognition possible). In this class: Hearsay I, Dragon, Lincoln Labs' system, International Business Machines system. Knowledge is usually phonological rules, lexicon, and syntax. (The IBM system has independent representations of language, phonology, and acoustic components, versus Dragon's uniform representation.) language definition used to integrate knowledge sources; task is the management of a submarine data-base. BBN: developed using "incremental simulation"; augmented transition network is a basic component; task is a travel budget manager. Hearsay II: uses a "blackboard" model, and an hypothesize and test paradigm; task is news retrieval.
Vocabulary is the primary source of restriction; confusability of words is key factor.
Unstressed function words always a problem. Syntax: primarily a search reducer; restricts possible alternatives; measurable in terms of "branching factor". Most common is a network representation, including augmented transition networks; second is a Markov process model. Semantics, "rules and relationships associated with the meaning of symbols": another search space reducer. Semantic nets primary.
A review of knowledge representation issues, using as an example Hearsay I under the voice-chess task.
Voice-chess was chosen since its syntax, semantics, and vocabulary are limited and well-defined. Some problems encountered in speech: 1) high data rate and large amounts of data, 2) errorful input, 3) real time response required. Uses separate knowledge sources and the "blackboard". Semantics module can rely on the fact that all a priori knowledge (chess rules) and all situational knowledge (the board state) are well defined. Even contains a primitive speaker model, in that the Tech chess-playing program ranks possible utterances for utility in the game. Syntax uses a context-free grammar, with "backward" "antiproductions" to predict from a given word permissible left and right word juxtapositions. Lexical knowledge has 31 words; uses knowledge of which syllables are stressed to help acoustic match. Presents a case study of "bishop to queen knight three".
Contrasts psychological active (motor) theories to passive (pattern recognition) theories; Hearsay is a blend. Claims pure analysis by synthesis is an unlikely model, due to efficiency considerations. Tabulates a taxonomy of types of knowledge necessary: at each level of speech processing, there are task, discourse, speaker, and analysis-dependent aspects of knowledge.

Overviews of Specific Issues
3.1.1 Organization and Control  D. R. Reddy, and L D. Erman, "Tutorial on System Organisation for Speech Understanding," Reddy, pp. 457-480, 1975.
A review of some of the more practical aspects of speech understanding research.
Knowledge representation: In speech, one can exploit the well-defined linguistic levels; the units of Knowledge in a higher level encompass more of the utterance. (Prosodies, however, is not a level). Error is ubiquitous; representations must be flexible.
Semantic nets, augmented transition networKs, production systems, and procedural embeddings possible.
Flow of control: hierarchy, heterarchy (sometimes based on incremental simulations), and blacKboard have been used. Search is either by dynamic programming (conceptually, in parallel) or best-first search.
Research facilities required: real time input, quicK tailoring of program parameters via "cliche" files, interactive debugging at a functional level, the handling of unplanned interrupts by user. Various types of performance analyses reviewed. Critical dimensions are accuracy, time, and space. Ablation experiments, "incremental improvement analysis" from studies of Knowledge source interaction, algorithmic analyses are possible. [Woods, 1975b] W. A. Woods, "Syntax, Semantics, and Speech," Reddy, pp. 345-400, 1975.

Syntax and Semantics
A review of some of the applications of computational linguistics to speech understanding systems.
Part I: Syntax. Reviews syntactic analysis schemes: phrase structure grammars (rewrite rules) and the ChomsKy hierarchy of automata. Nondeterministic machines simulated using bacKtracKing or parallelism: analysis is top-down, bottom-up, or mixed; predictive or not. However, in speech, phonological effects at beginning or end of sentences has bad effect on fixed order parsers. "Chart parsers" use word lattices to record well-formed substrings and their hierarchic dependencies; output an exhaustive list of all possible parse components (e.g. "Time flies like an arrow.").
Earley's algorithm is a fast hybrid chart parser. However, a further problem in speech and natural language: languages are not context-free, and even the context-free part is complex.
Reviews use of transition network grammars: some arcs are labeled with phrase constituents ("push") allowing recursion and the merging of subparts of grammar.
Transformational grammars are inefficient, and only one (perhaps) running computer program exists. Augmented transition networks have registers, and conditions and actions on their arcs; they are equivalent in power to Turing machines. In speech, augmented transition networks can be followed forward or backward to predict words, especially unstressed "function" words (prepositions, etc.).
Part II: Semantics ("the relation of symbols to meaning"). Reviews procedural semantics, as used in Lunar and Winograd. Lunar has a predicate calculus-like notation, directly translatable into Lisp procedures. Allows intentional (theorem proving) and extensional (execution against data base) reasoning.
Semantic interpretation, via production rules, maps syntactic structures into procedures. Most such routines are verb-based.
Use of semantics in speech: Semantic selectional restrictions can be incorporated into the syntax to form "semantic grammars". But this fails to parse questions dealing with hypotheticality, or negation. Also fails for pronouns (no semantic classifications possible on the pronouns), and is inextensible. Would prefer something that also handles "default" word senses, and preferences. Cite use of semantics in speech for prediction as well as verification. Outlines the semantic nets of Quillian, where the meaning of X is considered the sum total of X's associated concepts. Such semantic associations can be used to predict; so can superset relations and inheritance of superset attributes. Distinctive for its list of the 19 parameter values that a successful speech understanding system should have after the five-year effort. Basic viewpoint: Errors that count are errors in task accomplishment. Four task domains suggested 1) data base retrieval, 2) formatted data base entry ("voice key-punch"), 3) querying a computer system's status, 4) computer consultant (most ambitious of all). Each task is analyzed for possible control structures, and, at various speech levels (semantic, syntactic, lexical, etc.) for possible representations, knowledge and error sources, and problems. The 19 parameters are discussed in technical detail. [Baker, 1975] J. K. Baker, "Stochastic Modeling for Automatic Speech Understanding," Redely, pp. 521-542, 1975. Markov process X. Uses Bayes' theorem to evaluate the probability that Y(i) came from X(n), given probabilities that X produces Y(i).

Reviews
Markovian assumption of memorylessness simplifies computation: assumes that only the previous state (and not the entire preceding sequence) generates a given state. Examples of uses of this type of computation: many "low level" speech tasks. Outlines the Dragon system, in which linguistic, lexical, phonological, acoustic-phonetic, and semantic information are incorporated. All of Dragon's knowledge sources are probabilistic models of Markov processes, organized in hierarchies; dynamic programming is used to search for the best match. Thus, it analyzes all possible pronunciations of all possible sentences: still, time for utterance is linear.

An overview of the Dragon system.
Model is: probabilistic function of a Markov process, plus dynamic programming to search the space.
Recognition is linear in length of utterance; no combinatorial explosion. Stores a matrix for state-to-state transition probabilities. Signal match is via training, using Bayesian probabilities. Lexical knowledge is automatically compilable.
Uses a very flat (non-hierarchic) network. Syntax and semantics are mixed in "task grammar" (chess is example). Training data is used for transition probabilities and signal match. Uses purely declarative knowledge, and straightforward search. [Lowerre, 1976] B. T. Lowerre, "The Harpy Speech Recognition System," Ph.D. Thesis, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa., April, 1976.

Describes and criticizes
Hearsay I and Dragon, as well as Harpy.
Harpy combines best features of Hearsay I and Dragon, though is most similar to latter.
Hearsay I uses procedural knowledge, best-first search, and segmentation.
Dragon uses Markov network with a priori probabilities, dynamic programming, and no segmentation. Harpy uses state transition network with data-dependent transition probabilities, heuristically modified dynamic programming, and segmentation. Also, simplifies the network by recognizing and coalescing common subnets, and includes word juncture phenomena in the network itself (Dragon had none).
Dragon system features include: probabilistic system of a markov process; state probabilities of the network updated every 10 ms. Network contains all syntactic and phonetic knowledge, represented by inter-and intra-state transition probabilities.
Dynamic programming searches all paths, corresponding to all possible pronunciations of all possible sentences, to find best acoustic match. "Real action of the recognition process is due to the acoustic match probabilities".
Harpy: no interstate probabilities, just arcs (i.e. probabilities are one or zero) and intrastate transitions are dynamically calculated by reference to a table of minimum and maximum phoneme durations (and a heuristic threshold). Uses segmentation: performance is critically dependent on there being no missing segments; extra ones are easily handled by the network, however. Segmentation is based on linear predicive coefficients and several heuristic thresholds. Searching is sped up by only examining the (heuristically defined) "best" states of the network at any one utterance segment. [Jelinek, 1976] F. Jelinek, "Continuous Speech Recognition by Statistical
Details the IBM series of speech recognition systems.
Systems are for speech recognition, not understanding. They model utterance production statistically, rather than through a semantic grammar. Phone-based standalone acoustic processor segments utterance; generates for each segment, through various estimates, the one best phone label and its start and end times. Speaker's phonetic performance is modeled on word base forms, and phonetic rules (e.g. coarticulations), plus rules that reflect the occasionally inaccurate idiosyncracies of the acoustic processor. Each word can be represented as a finite state machine, with the base form pronunciations and the phonetic rules providing the states and arcs. A Language Model is used to provide a priori probabilities for all words (the "New Raleigh Language", generated from a finite state grammar and 250 words).
One system approach: expand the language definition with word states, generating one very large finite state machine, and use the "Viterbi algorithm" (dynamic programming) to find best sequence of phones. Problem: This also gives the pronunciation of the string, which is unnecessary. An alternative: best-first search through the grammar ("stack algorithm of sequential decoding"). Best-first beats dynamic programming, probably because of a bad model of the acoustic processor (i.e. incomplete rules modeling its behavior). [Bahl et a/., 1976] L R. Bahl, J. K. Baker, P. S. Cohen, N. R. Dixon, F.
Jelinek, R. L Mercer, and H. F. Silverman, "Preliminary Results of the Performance of a System for the Automatic Recognition of Continuous Speech," ICASSP, pp. 425-429, 1976.
Reports on the performance of the above' systems.
The system is an acoustic processor plus decoders; analysis is split is at phoneme level. Back end uses either a dynamic programming model given speaker and front end statistics, or a "stack decoder" which uses best-first search through grammar. Performance reported; also results of ablation studies: phonological rules removed. Also, tried various forms of speaker training: for example, by training front end only, and not back end.

Presents an early version of Hearsay I.
Model: small set of cooperating independent processes, plus hypothesize and test paradigm.
Parallel processes assumed necessary for real time response. Model is extensible and generalizable. Hearsay system modules include: speech input, speech output, task interface, and recognition subsystem (acoustics, syntax, semantics). Task is voice chess.
After a parametric level analysis and segmentation, the input is processed by 1) the acoustic recognizer (which has a hierarchy of increasingly accurate, but increasing costly tests), and/or 2) the syntactic recognizer (based on a grammar describing legal chess moves; "antiproductions" predict words to right or left of acoustically probable "islands"), and/or 3) the semantic recognizer (based on the chess-playing program Tech, which ranks legal moves by utility). Synchronization sequence is: 1) poll all, 2) "best" module hypothesizes, 3) the rest test. Voice chess appears to have a dominant semantics component.
The system is planned to have a "knowledge acquisition system" to dynamically update knowledge sources when parsing fails. Model is somewhat like analysis by synthesis except that individual words, not full utterances, are checked against input.
Comments that highest level cognition is serial, but lowest (sensory) is parallel. [Reddy et aZ., 1973a] D. R. Reddy, L. D. Erman, R. D. Fennell, and R. B. Neely, "The Hearsay Speech Understanding System: An Example of the Recognition Process," IJCAI3, pp. 185-193, 1973. In speech, much knowledge is* required. However, knowledge sources are errorful and incomplete, due to deficiencies in theory, implementation (e.g. heuristic search), or data. Knowledge sources cooperate via a universal data base called "blackboard". This problem solving model uses the hypothesize and test paradigm.
Each knowledge source is independent, and knows of no others. Knowledge sources are derived from a "natural" decomposition of all task knowledge. Each knowledge source is fired by the pattern-matching of its precondition with the blackboard, much like an asynchronous production system. It changes the blackboard according to its knowledge.

Contains a critique of Hearsay I p and presents Hearsay II as an answer to some of its problems.
Task is news retrieval; system is designed for a multiprocessor. Uses "multiple diverse sources of knowledge". Knowledge sources are analyzed along four dimensions: function (poll, hypothesize, test), structure (independent), cooperation (through global data base, the "blackboard"), and attention focusing.
Hearsay I had a global data base of partial sentence hypotheses composed of words, with word and sentence ratings. Hearsay I problems: 1) processing in word units only, 2) lockstep control, 3) hypotheses were not linked to each other, 4) policy is hardwired. Hearsay II answers: 1) three-dimensional data base, with nodes at each linguistic level, utterance time, and alternative parse. 2) Preconditions for firing a knowledge source: data is directed by "matching prototypes" and is event-driven, like a production system. 3) And-or graphs between hypotheses propagate scores. 4) There is an independent policy module. Hearsay II levels are: conceptual, phrasal, lexical, syllabic, surface-phonemic, phonetic, segmental, parametric.

Discussion
of the philosophy and implementation of control in Hearsay II.
The goal is minimization of knowledge source invocations. However, explicit control would destroy the flexibility of blackboard model. Basic approach: Each knowledge source action is summarized into a production: stimulus frame -> response frame. All decisions are based on these summaries. Fundamental principles and mechanisms: 1) Best alternatives on blackboard are tried first.
2) More processing to knowledge source with more valid data.
3) More processing to knowledge source producing most significant changes. 4) Efficient knowledge sources favored. 5) Knowledge sources satisfying goals are preferred.
Variable called "state" at each time in utterance indicates the validity of hypotheses there; potential knowledge source contributions are measured against present "state". If no progress in an area of the utterance, then the knowledge source firing thresholds are lowered. Their output is also rated to be more credible than the uncertainty present in the area would normally warrant. Response to "state" can be breadth-first or depth-first.
"Optimal strategy is not known." If "state" does not change for "a while", less desirable actions are tried in locations other than areas of high "state": prevents "cognitive fixedness".
Other knowledge sources ("policy modules") can modify the desirability ratings of various actions (response frames) effecting top-down, left-right, hybrid, etc., searches. A recognition network is imposed as a filter on the blackboard for the detection of precondition satisfaction; it also records partial precondition information. Preconditions are governed by thresholds, which can vary over the utterance, allowing flexible attention focussing. All hypotheses are linked, and inherit "plausibility" ratings from their support.
A presentation of an early form of Speechlls, featuring a description of the system-building technique of "incremental simulation 9 .
Need for higher level knowledge in speech: human spectrogram reading experiments indicate that a 257« error rate can be reduced to 47« when syntactic and semantic information are allowed to the interpreters. System is based on Lunar system discourse models. The knowledge gathered through incremental simulation includes the fact that "function" words are missed by acoustics, and must be proposed by syntax.
Speechlis has six components: acoustic-phonetic, phonological, lexical, syntax, semantics, and pragmatics. Control consists of selecting best "theories" (hypotheses), and the establishment and execution of demonic "monitors". General control flow: First, segment lattice fills word lattice with words consisting of three or more phonemes.
Each such word is given to semantics and becomes a theory. The priority-governed best-first search ensues.
Results so far: Use of "fuzzy" (with respect to time) word matches reduces theories.
Semantics can postulate semantic "clumps" (e.g. first person pronouns), also reducing theories. Pragmatics is in general too open-ended to use successfully, even though the speech signal has enough information to disambiguate otherwise confusing yntactic relations. Evaluations of the system are done with respect to human . uvu.uauuiid ui me sysiem are done with respect to human ("incremental") simulation. [Woods et aL, 1976

Overviews a Later version of the BBN system.
Travel budget manager is task. Data objects include 1) segment lattice (of phones, with probabilities, arranged chronologically), 2) theories (partial hypotheses of connected words), 3) monitors, notices, and events. Events are demons to watch for conditions in the word lattice; if conditions are met, notices are created and events (requests for further processing) are scheduled.
Based on a "pragmatic" grammar, which is topic-specific. A lexical retriever can predict the n best extension to islands, and control is by island-driving. First, the segment lattice is scanned left-right and right-left, to minimize word boundary effects.
Then, the best words are found and put in the word lattice; each becomes a one word theory. The following is repeated until done: Syntax expands the "best" theory with words and/or word categories; lexical retrieval then replaces categories with words.
"Fuzzy word matches" collect several related uncertain matches into one, if they are close in time. Island-driving from acoustically certain words is better than a strict leftright scan, as unusual phonological events occur at beginning and end of utterances.

Describes
the design and performance of the control structures in the BBN system.
Linguistic levels in the system are acoustic-phonetic, phonological, lexical, syntactic, semantic, and pragmatic. Data objects include the acoustic segment lattice, and the word lattice. Other data objects are theories (hypotheses concerning the original utterance), word monitors (which eventually cause condition-specific processing), and proposals (direct requests from one module to another). Evaluation of theories depends on acoustic match, duration information, syntactic and semantic scores, but almost no pragmatics. Control is started by the initial word lattice fill, and followed by evaluating and extending theories in order of priority. Some problems: theory of "thrashing" (attention focussing) not good; incremental simulation suggested to investigate it. Also, the theory of scoring utterance theories is inadequate. [Bates, 1974] M. Bates, "The Use of Syntax in a Speech Understanding System, " Erman, pp. 226-233, 1974;also Martin & Reddy, pp. 112-117, 1975.

Syntax
Outlines the syntax module in the BBN system, and describes some heuristics found necessary in its use.
Speech has "lexical ambiguity", that is, no clear word boundaries, no punctuation or capitalization clues. Also, small function words are unstressed, homonyms are confused ("see" versus "sea"), and word boundaries are lost ("tea meeting" versus "team eating").
Module uses an augmented transition network which hypothesizes basically topdown, (but can also operate bottom-up). An initial bottom-up pass of the acoustic modules constructs a "word lattice" with words of three phonemes or more. By "island-driving", the augmented transition network creates "monitors" on the lattice to look for hypothesized words. A problem: combinatorial explosion, as hypothesization is breadth-first (all possible valid-neighboring locations in the augmented transition network are hypothesized). So, heuristics are used. One: scoring hypotheses and the use of threshold cutoff. Another: calling the semantics module for verification.
Describes the semantics module of the BBN system, borrowed from Lunar.
Shows need for semantics: humans attain 907* intelligibility only when no more than two words have been excised from an eight word utterance. Uses "lexical semantics".
Semantics most useful for "content" words (which are stressed). Word lattice is filled by the acoustic-phonetic, phonological, and lexical modules, initially with words with three or more phonemes only. Data structures include events, monitors, and theories (hypotheses).
Lunar semantic model: syntactic tree structure has restrictional templates; templates are referenced by their head noun or verb. Notes that semantic information is easier to retrieve in natural language systems. The semantic network contains multi-word nodes (allowing "horizontal" searches for related missing words), and relations between nodes (allowing "vertical" hypotheses). Relations in the network contain case frames, a type of semantic filter (e.g. the use of the word "ratio" requires that the two units be the same). Semantics hypothesizes new words, constructs theories, and evaluates using filters. Semantics also interacts with syntax, and translates the input sentence into the necessary procedures to execute the request. This latter illustrates one difference between recognition and understanding.  B. Bruce, "Pragmatics in Speech Understanding," IJCAI4, pp. [461][462][463][464][465][466][467]1975.

Outlines
the task model and user model employed by the BBN system; has a rather natural language flavor.
Task is travel budget management task: user and task models employed. Intention of user (speaker): each speech act has presuppositions and desired outcomes.
Presuppositions can be used as a filter on possible parses. Such an "intent" has preconditions, a case structure for its verbs, a list of desired outcomes, and pointers to examples (i.e. it is a type of frame). Examples of intents: "confirm data item", "ask again". Basic suppositions of sincerity necessary for success of user model.
Intents forecast future intents; expectation links form a "mode of interaction" (somewhat like a script).
Modes have headers (preconditions) and a body of probabilistically linked intentions. Examples: "edit", "add" modes. Modes imply certain intents, which imply certain interpretations of speech. Thus, user and task model handle 1) expectations, 2) preference of parses, 3) actions to take (e.g. distinguishes between an "add data" intent and an implied "edit": "X is Y" adds Y to data base, unless data base has "X is Z".). [Walker, 1974] D. E. Walker, "The SRI Speech Understanding System," Ermarc, pp. 32-37, 1974.

An overview of an early version of the SRI system.
System is guided and controlled by parser. Task is repairing a leaky faucet. Parser is a best-first searcher; uses a case grammar for verbs. Grammar allows anaphora ("it"). A microworld model is incorporated in the semantic network. A discourse model allows abbreviated responses, in the context of a discourse ("What bolt?" "That one.").
Problems: Function words are unstressed, and words with liquids ("tool") are difficult. Acoustic, syntactic, semantic and discourse knowledge sources integrated by "language definition": procedural knowledge consisting of word-based phrase composition rules (system is phrase-based). Each possible linguistic phrase (e.g. "verb phrase", "noun") has several attributes (as in Lisp, with values; all knowledge sources can contribute them) and several "factors" (validity scores, from any knowledge source). Each phrase, when built, is immediately assigned attributes and scores. New phrases match a pattern of a part of language definition, which fires and evaluates.

Organization and Control
Language definition also incorporates a discourse module (e.g. one attribute of a phrase is "interpretation", which is the phrase's referent found by the discourse module.). Six levels of factors: "very good" to "out".
Executive consists of a parse net and an associated task queue. Priorities are partly a function of phrase "value": the maximum possible score, over all sentences possibly containing the phrase, given a heuristic search over existing "contexts" (other active phrases). Also partly dependent on attention focussing, which is designed to keep activity from stagnating in one place, and is biased towards complete interpretations.
Any partial results stored in parse net. [Paxton, 1976a]  Claims this organization and these issues are applicable to natural language. Also claims that natural language is like speech in that 1) conjunction and comparatives create combinatorial explosion, and 2) ungrammatically is like acoustic noise: some probabilistic method of choosing best interpretation is necessary. [Paxton, 1974] W. H. Paxton, "A Best-First Parser, " Erman, pp. 218-225, 1974.

Syntax
A description, and some performance analysis, of the SRI syntactic component.
Parser has four stages: syntactic (selects a legal grammatical class), lexical (selects a word), verification, and interparse cooperation. In verification, the priorities for a given parse are set using all other levels of knowledge. For example, semantic case frame agreement, word alignments in time (penalizes for gaps or overlap), acoustic match. Admits that setting priorities is highly empirical. In interparse cooperation, common subphrases are identified; old parts are integrated, along with their priorities, into new theories. Usually these subparts are noun phrases.
Relative performance analysis: parser performance is compared to a lower bound which is established by restraining it to the correct parse path; actual performance in best-first mode is three times this limit. A change to depth-first takes ten times lower limit. Also, there are studies with interparse cooperations toggled off and on. [Hendrix, 1975] G. G. Hendrix, "Expanding the Utility of Semantic Networks Through Partitioning; ' IJCAI4, pp. 115-121, 1975.

Semantics
A theoretical paper on semantic nets, which is applied, in part, to the SRI system.
Main problem with semantic nets is quantification and hypotheticality. Solution: Arcs and nodes are separated into "spaces"; each arc or node is in exactly one such space.
Each space has access only to itself and superset spaces: spaces thus can form lattices. Quantification (universal and its variants) is handled by quantifying individual elements within a semantic net subspace (the "form" of the propositions); quantified subspaces can be arbitrarily nested. This allows for the arbitrary mixing of universal and specific data. Partitioning also permits "want", "need", etc., to be distinguished from reality.
Real versus hypothetical worlds discriminated; even discriminates hypothetical worlds from each other. Used in SRI system to encode rules defining categories of objects (specifically, verb classes); this cuts down amount of information stored. Similar to use of "contexts" in some languages (say, Qlisp), but allows lattices, not just trees.

An analysis
of speech pragmatics, including task and user models; has a natural language flavor.
Problems of discourse analysis: "How does speaker decide what to include? How does the expression of new and old information differ?" Outlines some design issues of the pragmatics component of the SRI system. There is much deictic ("pointing") information in the task environment, and much term definition. Hierarchy (actually lattice) of tasks implies a locality of reference within a subdialogue. Anaphora is resolved with respect to the task tree structure. Task hierarchy can be used in the anticipation of references. One unresolved problem: implicit closures of subdialogues (e.g. Tve got it" ends subtask). This information used to simulate the mapper (and hypothetically better versions of it) in later experiments. 2) Language branching factor determined empirically, with and without acoustic restraints: usually three false alarms are better rated than the hit. 3) Two simple systems tested: dynamic programming on acoustics only, and a context-free grammar only. Both fail. 4) All cases of four binary control parameters tested on 60 utterances: island-drive versus left-right parse, breadth-or best-first acoustic checking of a set of proposed words, context checking, and selective focusing. Tested for accuracy and time. 5) Interword gaps and overlaps allowable in acoustic processor altered and found critical, due to word juncture phenomena. 6) Test of an increased vocabulary and improved acoustics (simulated by reducing false alarms). Result: 7% improvement in false alarm rate allows 507, bigger vocabulary. Summary: Acoustics is the bottleneck. [Neuberg, 1975] E. P. Neuberg, "Philosophies of Speech Recognition,"
A criticism of speech understanding research methodology.
Claims that success is due to increased computer power, and that research biases are simply reflections of various systems' "friendliness". Affirms that quantitative evaluations of techniques is difficult, and that the "scoring" of a parse is not well defined. Concerning prosodies: There is agreement to use it, in theory; but few do.
A short review of the achievements of the five year ARPA project in speech understanding research.
Three key aspects of the five year endeavor: 1) Multiple types of knowledge were brought to bear (syntax, semantics, coarticulation, phonology). 2) Many technical and scientific advances. 3) Interdisciplinary group. "A great deal is known, from the study of acoustics, phonetics, and linguistics, about the encoding of speech. . . . The sources of difficulty in understanding connected speech by machine are in the main rather well understood." Reviews the five major research efforts, plus four minor ones, which resulted in four major systems. Harpy's success is due to the task-oriented grammar. Hearsay II had 917 semantic accuracy. Other two systems are less accurate, but use grammars that are less constrained. System-building techniques evolved. Linear predictive coefficients for the low end is now almost standard. Lists the 19 specifications of the original report, and gives Harpy's corresponding achievements of them. [Klatt 1977] D. H. Klatt, "Review of the ARPA Speech Understanding Project," J. Acoust. Soc. Am., Vol. 62, No. 6, pp. 1345-1366, December, 1977.
A review of the four completed systems, a summary of the scientific achievements of the project, and a forecast of possible future research.
Notes that the ARPA specifications did not require 1) tasks relevant to real-world problems, 2) "habitable" languages, 3) cost effectiveness.
Success came from simplifying the problem by using syntactic and semantic constrains; thus the project was less successful in contributing to speech science.
Harpy met or exceeded specifications.
The speech understanding problem: described by way of an example ("Did you hit it to Tom?") illustrating phonological difficulties, and a two-part paradigm of speech systems ("high end" and "low end"). The role of higher level knowledge is seen as that of constraint provision.
Speech understanding systems: four systems reviewed and discussed. Syntactic and semantic constraint can be measured by the average branching factor of the grammar. SDC: Low end is syllable based; high end is best-first left-right scan of words. High end is sensitive to low-end errors. Discussion: Unclear why system failed. Possibly due to syntax's dependence on (usually unstressed) function words. BBN: Low end produces a speech segment lattice, which can easily represent phonetic ambiguity. High end is island-driven, using an augmented transition network grammar with semantic constraints; search is thus best-first. System includes semantic procedures to produce an audio response. Discussion: Syntax is more general than other systems. Theoretical potentials unachieved, however; not enough optimization, perhaps.
Hearsay II: Organization is central blackboard with asynchronous knowledge sources (both low and high end). Word verification module is based on Harpy; system control is through island-driving. Discussion: Second best to Harpy, perhaps because of overall desig n This included no absolute rejection of hypotheses, the optimization of components, and a grammar with the smallest branching factor.
Harpy: High end is an (acoustic) state network of all possible paths through a grammar, including word-juncture phenomena. Low end is based on linear predictive spectral match using Itakura metric. Search is heuristically-modified dynamic programming.
Discussion: Success appears due to the network structure, the optimization of the network and the spectral templates, and strong syntactic restraints.
"Harpy is essentially a verification strategy." The sparse network (i.e. grammar) appears more critical than low-end accuracy (only 407o). Notes that CMU had a variable branching-factor grammar, which was a powerful performance analysis aid.
Discussion and conclusions: Notes scientific advances in twelve broad categories. 1) System organization: Harpy's "beam search" and the Hearsay II blackboard.

2)
Grammar design: CMU's variable branching-factor grammar; the use of branching factor, rather than vocabulary size, as a measure of complexity. 3) Control strategies: left-toright is best only when function words are handled well. 4) Semantics and content: semantic grammars predominate. 5) Syntax: augmented transition networks are probably best for complex grammars.

6)
Word verification: "Formal rules of considerable predictive power have been developed." 7) Acoustic-phonetic processing: Harpy shows that phonetic segmentation and labeling is not necessary. 8) Use of statistics: usually, it is impossible to get a large enough sample set.

9)
Acoustic analysis: linear predictive coefficients or filter banks are both satisfactory. 10) Talker-normalization: Harpy's is automatic. 11) Response generation: which emphasizes understanding over recognition. 12) Contributions to speech science: includes the observation that some of the structures of speech understanding systems may be good models for human sentence comprehension. A proposed future system: a Harpy-like low end, with an augmented transition network high end. Performance, however, would depend critically on "missing pieces" of speech science (e.g. a diphone dictionary). Cites the relationship of such a system to psychological models of speech perception. The proposed system makes four novel conjectures, including the human use of precompiled networks; but also leaves several questions unanswered.
Future research: Low-end: Key is the transforming of the phonetic identification problem into a spectral identification problem, as with Harpy. High-end: What is needed is realistic semantic constraints, and better human engineering. Other hard problems include increasing the grammar branching factor, distinguishing words that are more acoustically similar, and accurate function word recognition.
Four appendices are included that detail the SDC, BBN, Hearsay II, and Harpy systems. 4.