the Acoustical Society of America, volume 85(1).

166 [Thagard, 1989] Thagard, P. (1989). Explanatory coherence. Behavioral and Brain Sciences, 12, number 3. [Vayra and Fowler, 1992] Vayra, M. and Fowler, C. A. (1992). Supralaryngeal Gestures in Spoken Italian. Phonetica, 49:48{60. Declination of

[Venugopala and Reggia, 1988] Venugopala, R. D. and Reggia, J. A. (1988). Parsimonious covering as a method for natural language interfaces to expert systems. Technical report, University of Maryland. TR-2037. [Waibel, 1990] Waibel, A. (1990). Prosodic Knowledge Sources for Word Hypothesization in a Continuous Speech Recognition System, pages 534{537. Morgan Kaufmann Publishers. [Waltz, 1975] Waltz, D. (1975). Understanding line-drawings of scenes with shadows, pages 19{91. McGraw Hill. [Westbury and Fujimura, 1989] Westbury, J. and Fujimura, O. (1989). An articulatory characerization of contrastive emphasis in correcting answers. In Journal of the Acoustical Society of America, volume 85(1). [Wolf and Woods, 1980] Wolf, J. and Woods, W. (1980). The HWIM Speech Understanding System, chapter 14. Prentice-Hall Signal Processing Series. [Zardozny, 1991] Zardozny, W. (1991). Perception as abduction some parallels with nlu. In Towards Domain-Independent Strategies for Abduction, pages 92{97. [Zwicker, 1961] Zwicker, E. (Feb 1961). Subdivision of the audible frequency range into critical bands. In Journal of Acoustical Society of America, volume volume 33.

165 [Reddy, 1976] Reddy, D. R. (April, 1976). Speech recognition by machine: A review. In IEEE Proceedings, volume 64(4). [Reggia, 1983] Reggia, J. (1983). Diagnostic expert systems based on a set covering model. International Journal of Man-Machine Studies, 19:437{460. [Reggia, 1985a] Reggia, J. (1985a). Abductive inference. In Karna, K. N., editor, Proceedings of The Expert Systems in Government Symposium, pages 484{489. IEEE Computer Society Press. [Reggia, 1985b] Reggia, J. (1985b). Virtual lateral inhibition in activation model of associative memory. In Proceedings of International Joint Conference on Arti cial Intelligence, volume IJCAI, 1985, pages 244{248. [Reiter, 1987] Reiter, R. (1987). A theory of diagnosis from rst principles. Arti cial Intelligence, 32(1):57{95. [Rock, 1983] Rock, I. (1983). The Logic of Perception. MIT Press. [Sacerdoti, 1974] Sacerdoti, E. D. (1974). Planning in hierarchy of abstraction spaces. Arti cial Intelligence, 5:115{135. [Samuel, 1963] Samuel, A. L. (1963). Some studies in machine learning using the game of checkers. Computers and Thought, pages 71{105. [Sembugamoorthy and Chandrasekaran, 1986] Sembugamoorthy, V. and Chandrasekaran, B. (1986). Functional representation of devices and compilation of diagnostic problem-solving systems. Experience, Memory, and Reasoning, pages 47{73. [Shortli e, 1976] Shortli e, E. H. (1976). Computer-based Medical Consultation: MYCIN. American Elsevier, New York. [Smith et al., 1985] Smith, Jr., J., Svirbely, J. R., Evans, C. A., Strohm, P., Josephson, J. R., and Tanner, M. C. (1985). Red: A red-cell antibody identi cation expert module. Journal of Medical Systems, 9(3):121{138. [Smith, 1985] Smith, Jr., J. W. (1985). RED: A Classi catory and Abductive Expert System. PhD thesis, Ohio State University. [Smith et al., 1988] Smith, Jr., J. W. et al. (1988). PATHEX/LIVER|Integrating Generic Tasks for Morphological Diagnosis. Technical report, The Ohio State University. In preparation. [Tanner et al., 1991] Tanner, M. C., Fox, R. K., Josephson, J. R., and Goel, A. K. (1991). On a strategy for abductive assembly. In Towards Domain-Independent Strategies for Abduction, pages 80{83. [Thagard, 1988] Thagard, P. (1988). Computational Philosophy of Science. Bradford Books / MIT Press.

164 [Peng, 1986] Peng, Y. (January, 1986). A Formalization of Parsimonious Covering and Probabilistic Reasoning In Abdcutive Diagnostic Inference. PhD thesis, Univesrity of Maryland at Baltimore County. [Peng and Reggia, 1990] Peng, Y. and Reggia, J. A. (1990). Abductive Inference Models for Diagnostic Problem-Solving. Springer-Verlag. [Pierrehumbert, 1979] Pierrehumbert, J. (1979). The perception of fundamental frequency declination. In Journal of Acoustic Society of America, volume 66. [Pople, 1973] Pople, H. (1973). On the mechanization of abductive logic. Proceedings of the Third International Joint Conference on Arti cial Intelligence, pages 147{ 152. [Pople, 1977a] Pople, H. (1977a). The formation of composite hypotheses in diagnostic problem solving: An exercise in synthetic reasoning. In Proceedings of IJCAI 5, pages 1030{1037. [Pople, 1982a] Pople, H. (1982a). Heuristic Methods for Imposing Structure on Illstructured Problems: The Structure of Medical Diagnosis, volume 51 of AAAS Selected Symposia Series, pages 119{190. Westview Press. [Pople, 1977b] Pople, H. E. (1977b). The formation of composite hypotheses in diagnostic problem solving: An exercise in synthetic reasoning. In Proceedings of the Fifth International Joint Conference on Arti cial Intelligence, pages 1030{1037. IJCAI 77. [Pople, 1982b] Pople, H. W. (1982b). Heuristic methods for imposing structure on ill-structured problems. In Szolovits, P., editor, Arti cial Intelligence in Medicine, pages 119{190. Westview Press. [Punch, 1989] Punch, W. F. (1989). A Diagnosis System Using a Task Integrated Problem Solver Architecture (TIPS), Including Causal Reasoning. PhD thesis, The Ohio State University. [Punch et al., 1986] Punch, W. F. I., Tanner, M. C., and Josephson, J. R. (1986). Design considerations for peirce, a high-level language for hypothesis assembly. In Karna, K. N., Parsaye, K., and Silverman, B. G., editors, Proceedings of Expert Systems in Government Symposium, pages 279{281. IEEE Computer Society Press. [Rabiner, 1988] Rabiner, L. R. (1988). A Tutorial on Hidden Markov Models and Seletcted Applications in Speech Recognition, pages 267{296. Morgan Kaufmann Publishers. [Randolph, 1989] Randolph, M. A. (1989). Syllable-based Constraints on Properties of English Sounds. PhD thesis, Massachusetts Institute of Technology. [Rassner, 1987] Rassner, R. (Oct. 1987). Data analysis manual. Technical report, University of Wisconsin Speech Motor Control Laboratories X-Ray Microbeam Facility.

163 [Lee, 1988] Lee, K.-F. (1988). Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System. CMU. [Liberman and Mattingly, 1985] Liberman, A. M. and Mattingly, I. G. (1985). The motor theory of speech perception revisited. In Cognition, volume 21, pages 1{36. [Lin, 1991] Lin, D. (1991). Obvious abduction. In Towards Domain-Independent Strategies for Abduction, pages 51{58. [Lowerre and Reddy, 1980] Lowerre, B. and Reddy, R. (1980). The harpy speech understanding system. Trends In Speech Recognition. [Marquis, 1991] Marquis, P. (1991). Towards data interpretation by deduction and abduction. In Towards Domain-Independent Strategies for Abduction, pages 59{63. [Marr, 1981] Marr, D. (1981). Arti cial Intelligence: A Personal View, pages 129{ 142. The MIT Press. Also appears in Arti cial Intelligence 9(1):47-48, 1977. [Marr, 1982a] Marr, D. (1982a). Vision. W.H. Freeman and Co., San Francisco. [Marr, 1982b] Marr, D. (1982b). Vision. W. H. Freeman and Company. [Nelson, 1979] Nelson, W. L. (1979). Automatic alignment of phonetic transcription with articulatory events from x-ray data of continuous speech utterances. Speech Communication Papers, pages 63{66. [Newell, 1990] Newell, A. (1990). Uni ed Theories of Cognition. Harvard University Press, Cambridge, Mass. [Ng and Mooney, 1989] Ng, H. T. and Mooney, R. J. (August 21, 1989). Occam's razor isn't sharp enough: The importance of coherence in abductive explanation. In 2nd AAAI Workshop on Plan Recognition. [Patten, 1988] Patten, T. (1988). Systemic Text Generation as Problem Solving. Cambridge University Press, New York. [Patten et al., 1992] Patten, T., Geis, M. L., and Becker, B. D. (1992). Toward a theory of compilation for natural language generation. Computational Intelligence, 8(1). [Patterson, 1987] Patterson, R. D. (1987). A pulse ribbon model of monaural phase perception. In Journal of the Acoustical Society of America, pages 1560{1586. [Patterson, 1988] Patterson, R. D. (1988). Timbre cues in monaural phase perception. Basic Issues In Hearing. [Pearl, 1987] Pearl, J. (1987). Distributed revision of composite beliefs. Arti cial Intelligence, 33(2):173{215. [Peirce, 1955] Peirce, C. S. (1955). Abduction and induction. In Buchler, J., editor, Philosophical Writings of Peirce, chapter 11, pages 150{156. Dover.

162 [Hobbs et al., 1988] Hobbs, J. R., Stickel, M., Martin, P., and Edwards, D. (1988). Interpretation as abduction. In Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics. Arti cial Intelligence Center, SRI International. [Josephson, 1987] Josephson, J. (1987). A framework for situation assessment: Using best-explanation reasoning to infer plans from behavior. In Proceedings of Expert Systems Workshop, pages 76{85. [Josephson and Fox, 1991] Josephson, J. and Fox, R. (1991). Peirce-IGTT: A Domain-Independent Problem Solver for Abductive Assembly. Technical report, The Ohio State University. [Josephson and Josephson, 1993] Josephson, J. and Josephson, S., editors (1993). Abductive Inference: Computation, Philosophy, Technology. Cambridge University Press. Forthcoming. [Josephson, 1982] Josephson, J. R. (1982). Explanation and Induction. PhD thesis, The Ohio State University. [Josephson, 1988] Josephson, J. R. (1988). Towards a generic architecture for layered interpretation tasks|implications for plan recognition. Technical report, The Ohio State University. [Josephson, 1990] Josephson, J. R. (1990). On the 'logical form' of abduction. In AAAI Spring Symposium Series: Automated Abduction, pages 140{144. [Josephson et al., 1984] Josephson, J. R., Chandrasekaran, B., and Smith, Jr., J. (1984). Assembling the best explanation. In Proceedings of the IEEE Workshop on Principles of Knowledge-Based Systems, pages 185{190, Denver, Colorado. IEEE Computer Society. A revised version by the same title is available as a technical report. [Josephson et al., 1987] Josephson, J. R., Chandrasekaran, B., Smith, Jr., J., and Tanner, M. C. (1987). A mechanism for forming composite explanatory hypotheses. IEEE Transactions on Systems, Man and Cybernetics, Special Issue on Causal and Strategic Aspects of Diagnostic Reasoning, SMC-17(3):445{54. [Josephson and Goel, 1991] Josephson, J. R. and Goel, A. K. (1991). Practical abduction. Technical report, The Ohio State University. [Josephson et al., 1985] Josephson, J. R., Tanner, M. C., Smith, Jr., J., Svirbely, J., and Straum, P. (1985). Red: Integrating generic tasks to identify red-cell antibodies. In Karna, K. N., editor, Proceedings of The Expert Systems in Government Symposium, pages 524{531. IEEE Computer Society Press. [Keuneke, 1988] Keuneke, A. M. (1988). Machine Understanding of Devices Causal Explanations of Diagnostic Conclusions. PhD thesis, The Ohio State University. [Klatt, 1990] Klatt, D. H. (1990). Review of the ARPA Speech Understanding Project, pages 554{575. Morgan Kaufmann Publishers.

161 [Fox et al., 1992b] Fox, R., Josephson, J., and Trusko, B. (1992b). Automated theory decision making: A case of evolution vs. creationism. Technical report, The Ohio State University. [Friedland, 1979] Friedland, P. E. (1979). Knowledge-based Experiment Design in Molecular Genetics. PhD thesis, Stanford University. [Fujimura, 1986] Fujimura, O. (1986). Relative invariance of artciulatory movements: An iceberg model. In Perkell, J. S. and Klatt, D. H., editors, Invariance and Variability in Speech Processes. Hillsdale, N.J.: Lawrence Erlbaum Associates. [Fujimura, 1991] Fujimura, O. (1991). Methods and goals of speech production research. Language and Speech. [Fujimura, 1992] Fujimura, O. (1992). Phonology and phonetics - a syllable-based model of articulatory organization. In Journal of Acoustical Society of Japan. [Fujimura et al., 1991] Fujimura, O., Erickson, D., and Wilhems, R. (August 1991). Prosodic e ects on articulatory gestures - a model of temporal organization. In Proceedings of Interactional Congress on Phonetic Sciences. [Fujimura and Lovins, 1978] Fujimura, O. and Lovins, J. B. (1978). Syllables as concatenative phonetic units. Syllables and Segments, pages 107{120. [Fujimura and Wilhelms, 1991] Fujimura, O. and Wilhelms, R. (Fall, 1991). Time functions for elemental articulatory events. In Journal of Acoustic Society of America, ASA Meeting. [Glass and Zue, 1988] Glass, J. R. and Zue, V. W. (1988). Multi-level acoustic segmentation of continuous speech. In Proceedings of the International Conference for Acoustics Speech and Signal Processing. IEEE. [Greenewald, 1990] Greenewald, J. H. (1990). Automatic Annotation of Phonetic Events Using Articulatory And Acoustic Data. PhD thesis, The Ohio State University. [Gregory, 1987] Gregory, R. L. (1987). Perception as hypotheses. In Gregory, R. L., editor, The Oxford Companion to the Mind, pages 608{611. Oxford University Press. [Halliday, 1985] Halliday, M. (1985). Language as Social Semiotic. Edward Arnold, London. [Harman, 1965] Harman, G. (1965). The inference to the best explanation. Philosophical Review, LXXIV:88{95. [Harrold and Eve, 1987] Harrold, F. B. and Eve, R. A., editors (1987). Cult Archaeology and Creationism: Understanding Pseudoscienti c Beliefs about the Past. University of Iowa Press.

160 [Chandrasekaran, 1988] Chandrasekaran, B. (1988). Generic tasks as building blocks for knowledge-based systems: The diagnosis and routine design examples. Technical report, The Ohio State University. [Chandrasekaran, 1991] Chandrasekaran, B. (1991). Models versus rules, deep versus compiled, content versus form. IEEE Expert, April:75{79. [Charniak, 1986] Charniak, E. (1986). A neat theory of marker passing. In Proceedings of AAAI-86, volume 1, pages 584{588. AAAI, Morgan Kaufmann. [Charniak and McDermott, 1985] Charniak, E. and McDermott, D. (1985). Introduction to Arti cial Intelligence. Addison Wesley. [Chow et al., 1987] Chow, Y. L., Dunham, M. O., Kimball, O. A., Kranser, M. A., Kubala, G. F., Makhoul, J., Roucos, S., and Schwartz, R. M. (April, 1987). Byblos: The bbn continuous speech recognition system. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 89{92. [Dasigi, 1988] Dasigi, V. R. (1988). Word Sense Disambiguation in Descriptive Text Interpretation: A Dual-Route Parsimonious Covering Model. PhD thesis, University of Maryland, College Park. [Davis and Mermelstein, 1980] Davis, S. B. and Mermelstein, P. (Aug 1980). Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. In IEEE Transaction of Acoustics, Speech, Signal Processing, volume ASSP-28, pages 357{366. [de Kleer, 1986] de Kleer, J. (1986). Reasoning about multiple faults. AI Magazine, 7(3):132{139. [Erman, 1980] Erman, L. (1980). The hearsay-ii speech understanding system: A tutorial. In Lea, W. A., editor, Trends In Speech Recognition. Prentice-Hall Signal Processing Series. [Fischer, 1991] Fischer, O. (1991). Cognitively Plausible Heuristics to Tackle the Computational Complexity of Abductive Reasoning. PhD thesis, The Ohio State University. [Fowler, 1986] Fowler, C. A. (1986). Current perspectives on language and speech production: A critical overview. In Danilo , R., editor, Speech Science: Recent Advances. College-Hill Press. [Fox and Josephson, 1991] Fox, R. and Josephson, J. (1991). CV Experiment and Results. Technical report, The Ohio State Unjiversity. [Fox et al., 1992a] Fox, R., Josephson, J., and Lenzo, K. (1992a). An abductive articulatory recognition system. Technical report, The Ohio State University. [Fox et al., 1991] Fox, R., Josephson, J., and Thadani, S. (1991). A three level abduction machine for word recognition from articulation. Technical report, The Ohio State University.

[Ahuja, 1985] Ahuja, S. (November, 1985). An Arti cial Intelligence Environment for the Analysis and Classi cation of Errors in Discrete Sequential Processes. PhD thesis, University of Maryland at Baltimore County. [Allemang et al., 1987] Allemang, D., Tanner, M. C., Bylander, T., and Josephson, J. R. (1987). On the Computational Complexity of Hypothesis Assembly. In Proc. Tenth International Joint Conference on Arti cial Intelligence, Milan. [Allemang, 1990] Allemang, D. T. (1990). Understanding Programs As Devices. PhD thesis, The Ohio State University. [Bahl et al., 1983] Bahl, L. R., Jelinek, F., and Mercer, R. (March, 1983). A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Independence, PAMI-5(2):179{190. [Berra, 1990] Berra, T. (1990). Evolution and the Myth of Creationism: A Basic Guide to the Facts in the Evolution Debate. Stanford University Press. [Browman and Goldstein, 1986] Browman, C. P. and Goldstein, L. (1986). Towards an articulatory phonology. Phonology Yearbook 3. [Browman and Goldstein, 1988] Browman, C. P. and Goldstein, L. (1988). Tiers in articulatory phonology with some implications for casual speech. Laboratory Phonology I: Between the Grammar and the Physics of Speech. [Bylander, 1991] Bylander, T. (1991). A tractable partial solution to the monotonic abduction problem. In Towards Domain Independent Strategies for Abduction. [Calistri, 1989] Calistri, R. J. (August 21, 1989). Plan recognition in the presence of user miconceptions. In 2nd AAAI Workshop on Plan Recognition. [Carver and Lesser, 1991] Carver, N. and Lesser, V. (1991). Blackboard-based sensor interpretation using a symbolic model of the sources of uncertainty in abductive inferences. In Towards Domain-Independent Strategies for Abduction, pages 18{24. [Chandrasekaran, 1986] Chandrasekaran, B. (1986). Generic tasks in knowledgebased reasoning: High-level building blocks for expert system design. IEEE Expert, pages 23{30. Fall 1986. [Chandrasekaran, 1987] Chandrasekaran, B. (1987). Towards a functional architecture for intelligence based on generic information processing tasks. In Proceedings of the Tenth International Joint Conference on Arti cial Intelligence. 159

158 lead us to believe that large systems are possible and reasonable. Future work lies in further researching such problems and implementing larger systems using the same framework as the layered abduction strategy presented here. As shown in this document, layered abduction is a useful means of data interpretation. The articulatory recognition task can be thought of as just that, data interpretation. Given the Microbeam pellet data, one must infer the causes of those motions. Many other problems involve similar data interpretation tasks such as red blood cell antibody identi cation based on test reactions or character identi cation or plan recognition or vehicle identi cation based on sensor values [Josephson et al., 1984, Marquis, 1991, Lin, 1991, Carver and Lesser, 1991]. Further, there is a great need for systems which can take disparate sources of input, large but di erent types of knowledge, and form coherent explanations (or conclusions) for the input. Speech recognition is one such problem. Visual understanding, story comprehension, large scale diagnosis, theory formation are all such problems. We feel that with a layered abduction strategy as discussed in chapter 3 and as shown in chapter 4, we can build problem solvers which can solve any type of data inferencing problem. Abstractly, Peirce is a tool for constructing agents which generate explanations for the causes of the input. Our future research will revolve around implementing large systems using Peirce. We will continue to examine layered abduction and improve on Peirce. Eventually, we will (hopefully) construct a large scale speech recognition system using layered abduction and articulation.

157 All of these tasks are extremely useful for an abductive reasoner. Functional Representation seems a natural way to encode much of the knowledge that a system might need. Figure 37 shows a functional model used as a module which interacts with an abducer.

Figure 37: Functional Model with an Abducer Our work on legal reasoning is very limited. We would like to implement a di erent legal case, one which would make use of all of the evidence presented and generate and score hypotheses related to the case. A several month long legal case cannot be easily captured in a small knowledge-based system. Our Peyer example is very primitive and only covers a portion of the knowledge needed for legal reasoning. More research into the legal process and into particular cases is necessary to implement a large legal reasoning system. The work we have put into these two systems is not substantial but enough to

156 accepted hypothesis was a weak best, we could slightly weaken the plausibilities of incompatible hypotheses.
 Provide a more exible control for layered abduction. As it stands now, there

is no built-in mechanism for control multiple abducers in Peirce. Yet, we have ideas of how abducers should interact with each other. In past and current layered abductive systems, the control between the multiple abducers must be provided at the time the systems are constructed. If an opportunistic control strategy can be provided (such as in [Punch, 1989]) more time can be spent on encoding the proper knowledge rather than adjusting the processing between abducers. Punch, [Punch et al., 1986], has developed an opportunistic abducer, also called Peirce2 which displays some of the exible that we would like out of a layered abduction model.
Future Work


The work presented here is based on using layered abduction to solve perceptual problems. There are many types of problems that can make use of layered abduction. My own personal research will continue the pursuit of layered abduction. I would like to continue to examine various abductive problems and recast them in a layered abductive framework. The perceptual research will continue as ARTREC is expanded along the various dimensions discussed earlier. Also, ARTREC will continue to be used in conjunction with the C/D model so that we can continue to aid in Fujimura's research. I suggest here some further research for layered abduction in non-perceptual areas. We have implemented a small system to decide between two theories, evolution and creationism. This system uses static ndings and hypotheses, and determines which theory has more explanatory power. There are several dimensions which could make this decision making system more realistic. For one, none of the hypotheses are rated in terms of plausibility. Such knowledge is not easily available. Another problem is that it takes a shallow look at both theories. A more realistic implementation would include much deeper knowledge about the theories and what they can explain. One possible means of a deeper representation is to use Functional Representation as a means of storing knowledge[Sembugamoorthy and Chandrasekaran, 1986]. A functional representation can be used to evoke hypotheses, determine plausibility of hypotheses, and create expectations[Keuneke, 1988, Allemang, 1990, Punch, 1989].
2 Punch's

Peirce was constructed here at the LAIR while he was working on his dissertation. His version of Peirce is very similar in ways to our Peirce, however, our Peirce was written in Common Lisp as part of the Integrated Generic Task Toolset. Many of the same ideas behind Punch's Peirce can be found in our Peirce. The greatest exception is the exible control which was not part of our Peirce.

8.5 Expansions to Peirce

Layered abduction is a mechanism which can be applied to numerous problems. Peirce is a exible tool creating a rich programming environment. Even so, we envision some enhancements to the Peirce tool. These enhancements include (but are not limited to) the following:
 A greater number of levels to produce conclusions during the hypothesis com-

position (or assembly) cycle. Currently Peirce has con rmeds, essentials, clear bests and weak bests. Essentials are those hypotheses which seem to be the only plausible means of explaining some datum. Essentials are an idealization, there can always be alternative explanations found no matter how implausible. Clear bests are superior explainers, and weak bests are mildly better explainers. However, we could have many in-between levels. Essentials may be rated as absolute (if there are no other explainers at all), reasonable (if all alternative explainers have been ruled out or rated low) or just potential (if the search for alternative explainers was not exhaustive). Clear bests may come in di erent levels such as incredibly-clear-best, very-clear-best, mildly-clear-best, marginally-clear-best. What is envisioned is a cycle in which the criteria for \bestness" is decremented by some present amount on each pass through the algorithm. At rst, only the most superior hypotheses are used. Then, if needed, we resort to less superior, but still very good hypotheses. And so forth until, if necessary, we resort to using weak best hypotheses. The preset \decrement" amount and the amount of tradeo between desired coverage and desired accuracy can be set by the system builder or the system user either at system construction time or at run time. hypothesis is incompatible to a believed hypothesis, it is ruled out. However, if that believed hypothesis was a weak best, should we rule out the incompatible hypothesis? What if our decision to accept the weak best was erroneous? In ruling out an incompatible, the propagations of this action might be far reaching such as new essentials or clear bests being accepted into the composite. If the original choice to accept a hypothesis is tenuous (and this would be true in accepting a weak best), then any actions taken as a result of the acceptance should also be tenuous. Therefore, when we accept a hypothesis, we should not rule out incompatible hypotheses, we should adjust the plausibilities of the incompatible hypotheses depending upon how much con dence we have in the accepted hypothesis. If the accepted hypothesis was essential, we can rule out incompatible hypotheses. If the accepted hypothesis was a clear best, we could greatly weaken the plausibilities of incompatible hypotheses. If the

 Changing the propagation of incompatibilities. In our current system, if a

154 ARTREC, and using simple mathematical manipulations of prestored gesture templates, have ARTREC adapt these feature templates into more accurate versions. This process of adapting feature templates has thus far been accomplished by hand. However, we envision simple methods which will allow ARTREC to do this itself. Once this is accomplished, ARTREC will be able to teach itself new gesture forms with only a modest amount of guidance by the system designers. This learning method would be similar, in some ways, to neural network or HMM training. However, we would make the features explicit which would allow us to better correct ARTREC when it errs. Also, explicit features are necessary for Peirce. We can gain the advantage of training without losing the exibility or power of the abductive algorithm (unlike HMMs which must use a dynamic programming algorithm guided by some form of beam search, or neural networks which have no explicit algorithm at all).
 Syntactic knowledge is highly useful in speech recognition. Using a grammar

in speech recognition, we are able to rule out many words without giving them much consideration. While the original \Pine Street" data for ARTREC was simplistic and did not require the use of any grammar, future data might require this extra knowledge. We envision constructing an abductive parser to accomplish this. Research into abductive parsing has been done by [Dasigi, 1988, Venugopala and Reggia, 1988] and we are interested in using Peirce to implement such a parser. Thus, a syntactic level will be placed on top of the word level. This should be a fairly easy addition to ARTREC.

 Adding an acoustic signal to ARTREC is also envisioned. Eventually, we would

like to construct a full- edged speech recognition system in which an articulatory level is a part. To advance our research, we plan on adding an acoustic level to ARTREC. This level will input an acoustic signal and infer possible articulatory knowledge. While we will still make use of pellet data, at some point in the expansion of ARTREC, it is hoped that the pellet data will no longer be needed. This last expansion is the most ambitious and most certainly the most dicult.

We feel that the results from ARTREC are encouraging enough to continue examining articulatory recognition. Articulatory recognition is an important, and typically missing, element of speech recognition. The initial enhancements (more words and gestures, more abductive levels) will prove that large scale articulatory recognition is feasible. We have already shown layered abduction to be an ecient and powerful means of implementing articulatory recognition.

153 should top-down processing come into play during the problem solving? These questions show that I have only scratched the surface in our research. I have high hopes and we are very encouraged by the success of ARTREC thus far. But I have many ideas which have yet to implement. These ideas are for expansions to both ARTREC and the Peirce tool.
8.4 Future Expansions of ARTREC

There are many dimensions in which ARTREC is to be expanded. This work has already commenced. We have acquired new pellet data from the Madison Microbeam facility. This data is similar to the \Pine Street" data in that it consists of numbers and street names. However, we intentionally constructed the set of new data to include most of the types of articulatory gestures found in the English language although the gestures in the new data are not used in all possible conjunctions with each other. The lexicon size for the new data is around 65 words, about a quarter of which are polysyllabic words. We hope to determine if our method for nding syllable boundaries will work as well in polysyllabic words. This new data also features several forms of non-linguistic e ects in the pronunciation of speech. Besides emphasis, we have evoked other emotional responses such as boredom, humor and annoyance. While we might not be able to identify the exact form of e ect, we can detect that such an e ect has arisen and use this to determine how the e ect will alter the pronunciation. We will continue to pursue the e ects of prosody on articulation. The rst expansion is to make ARTREC run on the new data which requires two changes. First, ARTREC must be able to handle new gesture types and new words. This means enlarging the gesture and syllable knowledge to include the new forms. The second expansion allows ARTREC to use polysyllabic words by adding a new level of problem solving, separating the syllable level from the word level. This new level will take strings of syllables and attempt to infer, using a lexicon, what word or words could have caused those syllable strings. ARTREC can also use the lexicon to infer word boundaries if they are not clear from the mandible motion. We know already that layered abduction is a useful means of deriving word boundaries from strings of phonetic units[Fox et al., 1991] without additional knowledge other than a lexicon1. We feel that adding a new level will be relatively straightforward. We have several further enhancements in mind for ARTREC.
 Automated learning of gestures. This is a process whereby ARTREC will de-

rive its own gesture templates from input data. We will present examples to


will require some means of determining word boundaries. Currently, ARTREC uses mandible motion to infer syllable boundaries. There is no reason to assume that this measure will be successful in determining all syllable boundaries or any word boundaries.

152 not need to rely on the \coarticulation" rules used in past systems based on linear phonology. Prosody could be more directly represented. While that research never reached fruition, the C/D model reinspired us. Our ideas for using articulation for speech recognition could be revived. There are three lines of motivation behind our research. We want to construct a layered abduction speech recognition system. A second motivation is to center our approach on articulatory knowledge, appealing to articulation in order to solve some of the problems found in speech recognition. Our third motivation is to help Fujimura in his creation of the C/D model. We hope to implement a reverse mapping of the C/D model by using layered abduction. These goals have not yet been reached. However, ARTREC was a good start in the right direction having accomplished the initial step towards all three of the goals. We have learned a lot during the research on ARTREC and we hope to continue to have success as ARTREC grows. As far as contributions go, this work has shown the following. First, I have constructed a layered abduction system which works! I have demonstrated the usefulness of our abductive strategy in a series of experiments (both in speech recognition and outside of speech recognition). I have demonstrated the power of explanation as opposed to simple pattern matching. Next, I have constructed a system for articulatory recognition and shown its feasibility. I have argued (hopefully in a convincing manner) how this system could be used for a full- edged speech recognition system. Finally, I have shown how ARTREC can be used to help in the creation and debugging of Fujimura's new theory of speech production.
8.3 Limitations of ARTREC

ARTREC is a modest attempt at an articulatory speech recognition system. It is a far cry from what is needed to solve speech recognition. As pointed out throughout this document, ARTREC makes no attempt whatsoever to solve the acoustic speech recognition problem. I have only suggested ways to alter or combine ARTREC with other mechanisms for the purpose of speech recognition (in chapter 7.2). While I believe in these suggestions, I have not proven them. This single limitation is very staggering. I cannot prove the assumption that perception is abductive, without rst implementing a full speech recognition system with layered abduction. Further, I have only implemented a modest sized layered abductive system. A three level system using two abducers is not very complex. I have yet to answer questions pertaining to the overall control of a large layered abductive system. Can we bypass the control problems that Hearsay had to confront? Can we implement a layered abductive system in parallel? Can we implement the individual abducers so that they contribute to the overall solution rather than constantly o setting each other? I have shown that layered abduction is feasible. In the fourth ARTREC experiment, I showed how top-down processing can improve overall system performance. But when

P uture irections and

III onclusion



This dissertation has examined a problem solving task called layered abduction, and a particular method or strategy for solving this task, incorporated into a problem solving tool called Peirce. It is my thesis that layered abduction can be used to solve most (or all) types of interpretation tasks, and in particular, layered abduction can be used to solve the problem of speech recognition. This chapter reexamines the work presented in this dissertation. I will discuss the contributions of this work. I will discuss the limitations of the ARTREC system and discuss possible future enhancements. I will also entertain some possible expansions for the Peirce tool. Finally, I will give a future research agenda for advancing abduction and layered abduction in Arti cial Intelligence.
8.2 The Intention of this Research

We originally began this research in hopes of constructing a full-blown layered abduction speech recognition system (see gure 3 in chapter 1). This research dates back over three years when a proposal was granted a small amount of initial research money. The results of that research (documented in [Fox and Josephson, 1991, Fox et al., 1991]) led to two prototype speech recognition systems. One of these prototypes was a consonant-vowel isolated syllable recognizer. It used a feature-based abductive approach to recognize the syllables. The other prototype was a small multilayered articulatory recognition system which input articulatory gestures and inferred words. See chapter 3.4.2 for more details on these systems. The research was far from complete when we had to terminate the project. However, we continued to pursue the idea of a layered abductive speech recognition system. We then met Professor Osamu Fujimura who was attempting to de ne a new theory of speech production; one based on non-linear phonology. His new model, the Convertor/Distributor model, was to give new life to our own research. One novel idea that was involved in our original research was to use explicit articulatory knowledge. By appealing to such knowledge, a speech system would 151

150 any need for making use of less compiled knowledge in order to make sense of the ambiguities. In conclusion then, I have described a very ecient method for solving the problem of abductive inference. This method is equally suitable for solving single layered abductive problems and multi-layered abductive problems. I have given a plausibility argument that most inferential problems are solved by abduction, and I have indicated, by argument and example, how a single method for abduction can be used to solve problems ranging from theory formation to perception. It seems reasonable to suggest that this highly ecient (and in principle, parallelizable) method could play a signi cant role in human speech recognition. It seems rather certain that this highly ecient method can play a signi cant role in automated speech recognition as shown in this document.
7.7 Conclusion

In this chapter I have discussed the merits of using articulatory knowledge for speech recognition. In particular, I have discussed how ARTREC might be used in large acoustic speech recognition systems. I have argued that ARTREC is scalable into a much larger set of articulatory gestures and lexicon. I have discussed in detail some problems with Hidden Markov Models. HMMs are not panacea's for speech recognition. They are useful ways to encode hard-to-obtain knowledge by using training. However, this training technique could also be used to obtain the knowledge in di erent formats. Whether to use probabilities or not, is not really a concern. The concern is that current speech recognition research using HMMs is obscuring the problems associated with current linguistic theory with the bene ts of using a \learnable" mechanism which can nd features for itself. Training has many uses. However explicit knowledge of linguistics is also necessary to solve speech recognition problems. HMMs demonstrate the usefulness of training but obscure the need for feature knowledge. Without explicit feature-based knowledge, we lose the power of explanation (i.e. abduction). Without abduction, we lose many of the advantages for increasing the accuracy of speech recognition and many other types of problems. I have concluded this chapter with a brief restatement of my beliefs in the usefulness of layered abduction. As has been shown, layered abduction can be used to solve a variety of problem types. Peirce itself can be used to construct knowledge-based systems to solve these (inferential, explanatory) problems. In the next, and nal, chapter, I will review the contributions of the work presented here. I will discuss limitations of our approach. I will also o er some very possible extensions to the work.

149 derstanding physics, simulation, and so forth so that one can trace through a model in order to infer the consequences. This form of knowledge is used when the more compiled forms of knowledge are unavailable. For example, in theory formation, one might need to derive an entire new theory from rst principles. A beginning diagnostician will require to form a diagnostic conclusions from rst principles knowledge because the diagnostician has yet to compile his or her experience into more highly compiled knowledge. It seems reasonable to say that knowledge begins at a rst principles level, and that the more we use the knowledge, the more compiled it becomes. Diagnosis may occur at any of the four levels described above depending upon the experiences of the diagnostician. Perception can similarly occur using knowledge of any of these, however as time goes on in an individual's lifetime, he (or she) is more and more likely to use highly compiled knowledge for perceptual problems. And here is where the friction arises in my argument. I am endorsing a generic task solution of perception (by using layered abduction) and yet most of the perceptual knowledge seems to lie at the partial pattern matching or table look-up levels. Here is how this tension can be relieved. First, abduction can use highly compiled knowledge. Knowledge of explanatory coverage, evocation and instantiation of hypotheses, and hypothesis interactions can all be very highly compiled. This is, in fact, most likely considering how much we use perception. And when hypothesis evocation brings forth a single hypothesis to explain some datum, then the abductive explanation is made very simply, by accepting the hypothesis (as an essential). In such a case, there is no need to make use of a full-blown abduction algorithm, but to simply accept the hypothesis. When such a situation does not arise (that is, when more than a single hypothesis is suggested by evocation) then a decision is required. Therefore, the Peirce strategy, as described in chapter 3, is still sucient for solving perceptual problems. The strategy will come to a very quick decision when a single hypothesis is plausible (which seems to be the case quite often in perception) and deeper strategies take over when more than a single hypothesis is deemed as plausible. We can propose a method to solve a problem by using very ecient generic tasks while still using more highly compiled knowledge. If anything, by considering the perceptual tasks as abductive, we can determine the types of knowledge needed to solve the task (i.e. we require explanatory knowledge, knowledge of hypothesis interactions, etc...). And if the knowledge comes in a highly compiled form, abduction can still make use of this knowledge. Given some input to the auditory tract (i.e. the ears), we attempt to determine what was said. We do so by hypothesization, and by combining individual hypotheses (say of phonemes, syllables or words). This is exactly what abduction accomplishes. The need for an explanation does not require any amount of deliberation, in fact, when highly compiled and ecient knowledge is used, the abductive mechanism will form a composite hypothesis very rapidly. Only when ambiguities arise will there be

148 In chapter 4, I showed a particular use for the strategy. This strategy, incorporated in the tool Peirce, allows for opportunistic, island building problem solving. It is capable of solving most all problems which make use of explanatory knowledge, even in the face of uncertainty, noisy data, and lack of clear plausibility values for hypotheses. I feel that Peirce is a tool which captures a very generic form of problem solving and as such, Peirce is highly useful for constructing many types of problem solvers. The research that I have discussed has involved highly di erent forms of problem solving. There is still a tension in my argument that layered abduction can be used to solve perceptual problems. The tension lies in the rapidity with which humans can process speech (or vision) versus the seemingly deliberative aspect of abductive explanation. It does seem reasonable to suggest that a process of forming an explanation need not have any deliberation. In perception, we as perceivers, have a large amount of compiled knowledge. In [Patten et al., 1992, Chandrasekaran, 1991] it is suggested that problem solving knowledge is compiled across a wide spectrum of types which can be categorized into at least four distinct classes, table look-up, pattern matching, generic tasks and rst principles. Table look-up is the most highly compiled form of knowledge where we simply provide input (in the form of ndings or stimuli) and our minds look up (in some large table of knowledge) what the solution is. This form of knowledge may be thought of as a chunk [Newell, 1990] of knowledge, or as a hard-wired piece of knowledge which is nearly instantly retrieved. It seems reasonable that some amount of perceptual processing uses such highly compiled knowledge. At this level, there is no need for deliberation of any kind. The next category is at the level of partial pattern matching. In the past, partial pattern matching has taken the form of production systems or rule-based systems [Shortli e, 1976]. At this level, input is provided to the problem solver, and small steps are taken in order to determine an answer. Again, no or little deliberation takes place. The next category might be the Generic Task level [Chandrasekaran, 1987, Chandrasekaran, 1988, Chandrasekaran, 1986]. Generic tasks are at a level of problem solving where one subdivides the problem into subproblems, each of which can be solved by some generic method. Examples of generic tasks include hypothesis matching (which is a subtask of hypothesis instantiation as described in chapter 3), abduction, hierarchical classi cation. These task-level solutions require particular types of knowledge (such as parent-child relationships in classi cation, features of interest in hypothesis matching, and explanatory knowledge in abduction). As these tasks require more complex steps, some generic tasks may occur at a more deliberative level of processing. The last category is rst principles [Sembugamoorthy and Chandrasekaran, 1986, Reiter, 1987, Keuneke, 1988, Allemang, 1990, de Kleer, 1986]. At this level, knowledge is found in the form of models, and reasoning over this knowledge requires un-

147 can use this knowledge to aid the selection of explanatory choices in feature-based approach to abduction. Rather than driving HMMs with a dynamic programming algorithm, create feature hypotheses and use the probabilities stored in the HMMs as some of the available knowledge. The layered abduction strategy can use probabilities just as easily as \con dence" values. Our abductive strategy takes advantage of a notion which is ignored by using dynamic programming. When a hypothesis is much better than another hypothesis for explaining some datum, we feel secure in accepting it. If alternative explainers are almost as good (or plausible), we would feel less secure in accepting a hypothesis. Dynamic programming does not consider this. Dynamic programming will nd the most probable path through the search space no matter how close the \next best" path might be. Therefore, HMM systems have no means of stating how much certainty we could place in a conclusion formed by dynamic programming. This, to us, is unacceptable. HMMs are able to capture knowledge by training which might otherwise be unobtainable. This is the strength of HMMs, their easy ability to learn. We can use this learning method to store the knowledge needed in a feature-based abduction system. Thus, we combine the strengths of HMMs with the strengths of an abduction strategy.
7.6 Layered Abduction is the Proper Way to Solve Interpretation Tasks

This is quite a bold statement. But consider this: in reasoning about the world, we attempt to come to conclusions about the observations and ndings that we face. Our whole perspective of the world around us is driven by perception. If perception is truly an inferential process, then what better way should we attempt to solve such problems but by inference to the best explanation? I have argued in chapters 1 and 3 that Layered Abduction is a feasible and useful task for solving cascaded inference problems. I have given a general indication of several types of problems such as diagnosis, theory formation, story comprehension and perception. And even these problem types only scratch the surface. Others have used abduction to solve problems in all types of identi cation problems (medical, mechanical, sensor, concept, etc...). Therefore, when a problem calls for inference, when a problem calls for coming to a conclusion, then abduction is a method capable of supplying a conclusion. When problems call for multiple sources of input (for instance, sensors or visual or auditory input, or a combination of these) and when a problem calls for disparate knowledge types, then layered abduction should be considered. In chapter 3, I have discussed a particular strategy for solving abductive problems.

146 you inherently model linear phonology. It seems unreasonable to choose an implementation method (HMMs) which presupposes a theory of linguistics (linear phonology). HMMs make a commitment which, under ordinary implementations, does not need to be made. However, HMMs are not capable (at least in a straightforward manner) of modeling non-linear phonology because HMMs cannot model features. That is, nonlinear phonology describes units in terms of distinctive features. Sounds are made up of vocal tract features, not of phonetic units. HMMs are not capable of representing features of units, only of representing units themselves. Therefore, in choosing HMMs, a researcher is forced to model some form of unit, of a phonetic unit. In the past, the choices have been for modeling phonemes or triphones. And this results in modeling a weaker theory, linear phonology rather than non-linear phonology. I should state here that HMMs only encode knowledge, and that knowledge can be used in an abductive system. However, systems such as Sphinx and Byblos are at best impoverished abductive systems. They are capable of explanation, but only by appealing to a very limited set of knowledge and by using a very weak method of explanation. All of the knowledge in the HMM systems is collapsed into probabilities making it dicult to distinguish the types of knowledge needed in abduction (such as di erentiating between explanatory knowledge, hypothesis interaction knowledge, pragmatic concerns of the importance of data, and so forth). The only method used to drive these HMM systems is a dynamic programming algorithm using some form of beam search, a far cry from the more sophisticated strategies discussed in chapter 2 (e.g. the strategies of Pearl, Reggia and Peng, Pople and Hobbs) or the Peirce strategy discussed in chapter 3. I have stated that HMMs capture knowledge only in the form of probabilities. HMMs collapse all knowledge into a single form. This is done so that the eciency and optimality of dynamic programming can be used to drive the probabilities. It is also done because training HMMs is possible. However, in doing so, many things are lost. Among the losses is the ability to use other strategies, clever strategies, to seek the best explanation. Also lost is the ability to directly incorporate other forms of knowledge. And again, HMMs make it very awkward to use disparate forms of knowledge. Only by indirect means will this be possible. One solution to this problem is to use layered HMMs in much the same approach as our layered abduction. However, in layered abduction, we can use both upward owing and downward owing information to build on islands of certainty. It is unclear (at best) to see how HMMs could use any form of downward owing information or how islands of certainty could be used for leverage for further problem solving. These advantages to an abductive strategy are completely lost in the HMM paradigm. There is a way to use HMMs which can take advantage of hypothesis interactions, higher level knowledge, and \intelligent" problem solving. This is to use the probabilities stored in HMMs as expectation knowledge and plausibilities. However, instead of using this knowledge in a dynamic programming beam searching algorithm, we

145 processing as it did not allow the lower levels to query higher levels for additional help. It was strictly a one-way ow of communication. As Byblos shows, additional knowledge can be used by HMMs, but only in indirect ways. There is no way to use additional knowledge to a ect the transition probabilities within models. The additional knowledge can only be used between models, in a ect, to aid the beam search process. By using HMMs, a system is restricting the problem solving so that additional knowledge cannot be brought directly to bear. This is very unlike our layered abduction method where layers can directly interact by propagating expectations and other forms of information. Another criticism of HMMs pertains to training HMMs. In training HMMs, an algorithm is used to update transition probabilities within a model based on how accurately that model matches with the data in the training set. For example, if we are training a phoneme model on the sound /t/, and we have the phoneme /t/ in a training sentence, we adapt the /t/ model so that it is more able to match or recognition the /t/ sound. However, training sets are usually biased in some form. The bias can come from a limited set of sentences, a limited vocabulary, a limited usage of the phonemes (or other phonetic units), a limited amount of emphasis and stress, and so forth. The result is a biased set of HMMs. This problem of biased training sets is particularly relevant in the case of Sphinx and the Bigrams used there. The Bigrams are created from the training sets. Thus, if a speaker were to utter a series of words that did not occur at all in the training set, then Sphinx would have a much greater diculty in correctly identifying the utterance. There seems to need to have such bias in the knowledge of a speech system. One particular problem that we envision as the result of such bias would arise due to the e ects of prosody, and in particular, stress and emphasis. It is appearant from past speech recognition research that stress and emphasis are important concepts which should be used for accurate word identi cation. If a training set for HMMs contain \typical" pronunciation of words without stress, then the HMMs will be based on the \typical" pronunciation. If someone uses stress in their speech, an HMM speech recognition system would have diculties in accurately identifying those stressed words. One approach would be to store both stressed and unstressed word models in such a speech recognition system, but this might only increase the confusion of the system. It is hard to tell, based simply on probability transitions, when a word is stressed. Further, stress and emphasis occur not as a binary decision (i.e. there is stress or there is not stress) but rather, as a continuum from highly stressed to unstressed words. Thus, multiple sets of HMMs would be necessary to capture all variations of stressed and unstressed words. Another problem with HMMs is not with the models themselves, but in what they model. I have argued, along with others [Fujimura, 1992, Browman and Goldstein, 1986], that linear phonology is the wrong way to consider speech. Most past attempts at using HMMs have been in modeling phonemes. By modeling phonemes,

144 notions of hypothesis interactions are encoded within the probabilities themselves. We cannot say \this hypothesis becomes more likely because that hypothesis has been accepted". The only form of expectation lies in the transition probabilities between one word model and the next (as in, this model is the most probable one to follow from that model). And there is no explicit notion of incompatibilities. Expectation and incompatibility knowledge can be stored implicitly in the transition probabilities but this knowledge cannot be used by mechanisms such as hypothesis interactions as shown in Peirce. Hard decisions are impossible to delay or avoid in HMMs. In our problem solving, we may not wish to come to a conclusion over some datum because we want to order more tests, or give more deliberate thought to it, or even just avoid the problem (and we are all good at this form of procrastination if the problem is hard enough). HMMs do not even acknowledge hard decisions. A hard decision is the same as an easy decision. It is to simply nd the most probable path. A dynamic programming algorithm will not stop to consider islands of certainty. It will not decide that an alternative solution is \nearly as plausible". Further, there is no way of knowing how much con dence we should place in an answer generated from an HMM system. We can continue along these lines, but it should be appearant that HMMs using dynamic programming search techniques fail to make use of any of the abductive strategies discussed in chapter 3. Second, (and related to the rst), since all knowledge is encoded in the form of probabilities, HMMs are not easily combined with other forms of knowledge. Speech recognition requires bringing to bear many forms of knowledge from acoustic and articulation knowledge, to knowledge of syntax, semantics and pragmatics. It seems rather dicult to capture many of these types of knowledge as probability transitions. Therefore, this knowledge is not directly available for a system which uses HMMs. Sphinx could not use explicit syntactic knowledge. Instead, it forced the grammar into a probabilistic framework. This is unreasonable as it assumes the availability of grammar probabilities. Sphinx used Bigrams to capture the probability of one word following another. This form of knowledge is only useful in limited contexts. Sphinx could not use any higher level knowledge (semantics, pragmatics or discourse). While knowledge of syntax, semantics and pragmatics might be capturable in HMMs, it is not a natural way to store or use such knowledge. Byblos was able to use some higher level knowledge (syntactic knowledge) by encoding it into a more explicit, feature-based form. However, Byblos's grammatical knowledge and lexical knowledge were very loosely coupled. The only means of sharing knowledge between the two units was to use the grammar to prune away incorrect words. The problem with this is that the grammar could only make suggestions and could only determine when a word was not syntactically correct. Thus, the word space had to be searched for possible next words and only then did the grammatical module play a role by ruling some next words out. The grammar was limited in

 Some of the motions in the articulatory data can be ignored during the recog-

nition task. Therefore, some of the motions in speech production are unintentionally produced. The theory must take into account how these occur. Many presumably occur because of various non-linguistic factors.

The growth of both the C/D model and ARTREC are intertwined as they can both learn from each other. At the moment, both the C/D model and ARTREC are in their infancies. When the C/D model begins to produce speci c impulse response functions for articulation, then ARTREC will be able to more closely match the C/D model. Currently, the two models are somewhat independent as the speci c values needed for ARTREC are not yet available from the C/D model. ARTREC is also lacking in depth, stopping at the actuator motions rather than the musculature. However, in future, we expect that the C/D model will require ARTREC as an experimental implementation to determine where C/D goes wrong.
7.5 The Problems with idden ay arkov odels and how

Abduction can Save the

Hidden Markov Models have been used to solve a large portion of the speech recognition problem. They have shown to be very successful, whereas \feature-based" knowledge has led to failure. There are still several problems in using HMMs. These problems all exist because HMM's require the linguistic knowledge (of phonemes and words) to be represented in the form of probabilities and in no other form. First, all of the linguistic knowledge encoded within an HMM is implicit. There is no sense of \knowing" what knowledge exists in an HMM aside from probabilities. The explanatory knowledge is only capable of explaining segments (i.e. divisions within the acoustic utterance) and is incapable of explaining features that appear in the acoustic signal. Without making the explanatory knowledge more explicit, there is no easy way of using an algorithm like Peirce to drive the HMM knowledge1. The loss of any explicit knowledge makes the HMMs unusable for most problem solving regimes (such as heuristic approaches, task approaches, and so forth). Instead, the few means of using HMMs must lie in some form of search where the search criteria is simply the \most probable" path. This may not seem to be a aw for those who use HMMs. But if we consider the many ideas that go into our abductive strategy, we can see that most of those are unusable by HMMs. There is clearly no notion of essentials and leveraging essentials against further problem solving. The only
1 Peirce

could be used as a control mechanism for search through HMMs. However, Peirce attempts to explain features or individual data rather than a segment of speech. Therefore, Peirce would not generate an explanation for each feature detected in the speech signal but rather it would use chosen phoneme models to explain segmented regions of the signal.

142 nition knowledge. The types of gestures needed by the C/D model can be dictated by ARTREC's needs. The timings and magnitudes of pulses can be determined empirically by seeing where ARTREC works and where ARTREC fails. In short, we can use ARTREC, or a system like ARTREC, to determine where the speci cs of the C/D model fail and where the speci cs of the C/D model work. We have already begun by determining some of the needed gestures. We have determined what articulatory motions are important for gestures. We have determined, to some extent, how corrective emphasis will alter the motions of the articulators (in this case, just the jaw). Further research is needed to esh out the intricacies of the C/D model. However, the two models, C/D, and ARTREC (or variations of ARTREC) can grow symbiotically. Where ARTREC requires knowledge, we turn to the C/D model for guidance. Where ARTREC fails, we turn to the C/D model for corrections. Where the C/D model requires speci cs, we turn to ARTREC to implement them and see how ARTREC reacts. Among the initial types of information that ARTREC has been able to determine for the C/D model are as follows:
 Articulatory declination is salient in determining emphasis motions in the jaw.

Articulatory declination was also noticed (less substantially) in the less enunciated words. additional tongue blade motion.

 Alveolar closure which occurs at the nal position of a phrase includes some  Alveolar closure motion di ers when preceded by an /s/ sound (for instance, the

rst /t/ in \street" di ers from the /t/ in \two" or the nal /t/ in \street"). The di erence lies in amount of height for the tongue tip. We have found through empirical study that /str/ requires less height for the /t/ sound than /t/ occurring at the beginning or end of a word.

 Tongue Blade motions are less important than tongue tip and tongue dorsum

(at least from our data). ARTREC achieved equal (or even better) accuracy for data which did not contain a tongue blade pellet indicating that the features used in ARTREC which require tongue blade motions are not as important as other features.

 Extreme motions seem to be more \important" than the steady motions be-

tween extremes (the transitions). It seems reasonable to assume that we can explain the transitions in terms of the extreme motions (that is, by saying that transitions are the motions which bridge the utterances being pronounced). For the pine street data, we found that we could ignore the transitional motions and explain only the extreme motions.


Figure 36: A Layered Abduction Description of the C/D Model Inverted

140 that articulator, the shape or the type of motion needed to carry out the sound. The magnitude of the pulse (to the distributor) is used as a parameter to each articulator's impulse response function. This parameter dictates the amount of muscle activity, which will yield the adequate amount of motion for that articulator. The result is parallel channels of motions in the vocal tract, each operating separately to create a uni ed sound. Without going into much detail about the C/D model, we can say that, as it is a new theory, it requires \tuning" and \debugging". We can model the process of speech production and simulate the speech mechanism in order to determine if the theory is accurate. However, many of the details, or speci cs, of the theory are not yet available. We need a di erent means of debugging this theory. Layered abduction can help. Speech production is, abstractly, a planning or design problem. This task is one to of planning a particular set of motions in the vocal tract to create an utterance (in the form of an acoustic signal). The plan requires determining the motions and degree of motions of each part of the vocal tract. It also requires determining when each part of the vocal tract should begin its motion. The C/D model is basically a theory of planning, of planning the motions of the vocal tract. Abduction is a means of plan recognition[Ng and Mooney, 1989, Josephson, 1988, Josephson, 1987, Calistri, 1989, Lin, 1991]. We hope to implement an inversion of the C/D model by using layered abduction. The proposed inversion can be seen in gure 36. Given vocal tract motions, we can use layered abduction to determine the articulators responsible for those motions. We can use this information as data to determine the actions of the distributor. In turn, we can use the actions of the distributor to infer the actions of the Convertor. Finally, we can use the actions of the Convertor and infer the intended utterance. This mapping is similar in description to what ARTREC attempts. In fact, the structure of ARTREC came from the C/D model. Currently, ARTREC will map from pellet motions to articulatory gestures to syllables. The Distributor is responsible for generating, from articulatory units, the articulator motions. Thus, this rst level in ARTREC is the inverse of the Distributor to Articulator mapping. Groupings of gestures can be used to infer syllables. The Convertor takes phonetic units (such as syllables) and generates the articulatory units needed to create the pronunciation of those syllables. Thus, the second level in ARTREC is the inverse of the Convertor to Distributor mapping. It is, of course, not as simple as what is being presented here. The information that the C/D model requires is speci c knowledge pertaining to the types of gestures needed, the exact motions of the articulators, the way that the Distributor will distribute impulses among the Articulators, the degree to which each muscle must respond to the impulse, and so forth. To obtain these types of knowledge, we can use a system like ARTREC. The equations speci c to the articulators can be modeled in ARTREC and used as recog-


Figure 35: The Convertor/Distributor Model of Fujimura

138 number of ndings to explain because the previous level constrained the composite to only include con dent members. Creating islands of certainty helps to constrain the problem no matter how many potential explainers are available. Because of these issues, and because the Peirce strategy takes advantage of many forms of knowledge, I believe that ARTREC is scalable. Because of the constraining that results from layered abduction, there will always be a minimal number of ndings to be explained at each level, which in turn should generate a minimal number of potential explainers at each level. In the English language, there are a nite (perhaps 50 to 100) total number of articulatory gestures. From these combinations of gestures, all English syllables can be generated. The scalability of ARTREC will be proven if we can expand ARTREC's gesture lexicon by ve times its present size. This is one of the enhancements that are planned for the near future.
7.4 Abduction and the C odel

The Convertor/Distributor model[Fujimura and Wilhelms, 1991] is a new theory of non-linear speech production. Fujimura has suggested a means of speech production whereby the elements of the vocal tract (the articulators) will move independently of each other, but purposefully in order to create speech. The C/D Model is composed of three levels of processing, the Convertor, the Distributor, and the Articulators (or Actuators). See gure 35. In this model, a person will have a thought to express. With this thought (in the form of a phrase or a sentence), there will be additional non-linguistic factors such as excitement or boredom, or whether the speaker is in a hurry, or under the in uence of alcohol. Both the thought and these other factors are used as input to the C/D model. The output will be the movements of the vocal tract needed to create the utterance in mind. The Convertor takes a phonological representation of the utterance and the relevant non-linguistic factors and \converts" it into a series of time indexed \pulses". This is passes on to the Distributor. Each pulse has an identity (of the phonetic units involved) and a magnitude. This sequence of pulses is received by the Distributor. The Distributor then \distributes" or passes along, the magnitude and timing information to some of the articulators (or actuators). The Distributor knows, for each type of syllable (i.e. for each pulse type), which articulators are needed to generate the type of syllable. The magnitude is used to determine to what degree each articulator is needed (or to what degree the articulator should move). Finally, each articulator receives a signal indicating when and to what extent it should \ re" (or move). The rings of an articulator occurs in conjunction with the other articulators needed for the syllable. An impulse response function dictates, for

7.2.5 Conclusions of sing Articulatory nowledge

Since it is unlikely that pellet data will be widely available for future speech recognition systems, we must nd a new means of using articulatory knowledge of a system like ARTREC. Mapping directly from auditory features to articulatory units has not been tried. It has been proposed by [Liberman and Mattingly, 1985, Glass and Zue, 1988, Browman and Goldstein, 1986] and others that such a mapping is not only possible but necessary. The knowledge used by ARTREC is a rst approximation of the knowledge needed to get around problems of co-articulation, prosodic e ects, determining place of articulation, and so forth.
7.3 Scalibility of ARTREC

It is probably appearant to the reader that ARTREC solves only a small problem. Because only 9 words (syllables) and some 15 gestures are used in ARTREC, there might be concern over whether ARTREC can scale upwards to a larger lexicon. The high performance of ARTREC might be due to very limited number of hypotheses. However, I am sure that ARTREC will scale so that larger lexicons and larger numbers of gestures will not diminish ARTREC's performance. Here are several reasons which indicate that ARTREC is scalable.
 An ecient means of generating a small number of hypotheses is possible by

using hierarchical classi cation. This ensures that most of the potential hypotheses are ruled out before abduction even begins, which in turn reduces the complexity of the task for ARTREC. Further, by using hierarchical classi cation, hypotheses can be generated at di erent levels of speci city so that explainers can be, at rst, given in a general form and re ned later if necessary. This approach (of using hypotheses at di erent levels of speci city) has worked in medical diagnostic cases (for instance, see [Punch, 1989]). interactions, ARTREC can handle making decisions where many potential explainers are present. Therefore, a larger number of potential hypotheses might make the task more complex, but additional knowledge is readily available.

 Because of the use of incompatibility handling and other forms of hypotheses

 Hard decisions are avoided thus allowing ARTREC to build on islands of cer-

tainty. As long as some initial islands can be built, ARTREC can leverage these islands to continue building an explanation. ARTREC will only run into problems if no it cannot build the initial islands of certainty.

 Layered abduction o ers a means of constraining the possible combinatorial

explosion of hypotheses. This is because, at one level, there will be a limited

136 ognizable. Past systems have gotten around this problem by either doing without place of articulation (as in Hearsay) or by using some spectral pro les without speci c features, and using these pro les as matching templates (as in the case of Harpy) or by training models so that place of articulation does not necessarily need to be determined (Byblos, Sphinx). Identifying place of articulation is very important. As seen in Hearsay, nding manner of articulation is not sucient to determine the phonetic units. Hearsay proposed too many phonetic units because of the ambiguities created by manner of articulation without place of articulation. There are some features which are available to aid in identifying the place of articulation (such as formant locations in the transitionary period between consonants and vowels, spectral tilt and spectral burst pro les). However, as indicated in [Fox and Josephson, 1991] and other places, such features do not work reliably. Articulatory context and non-linguistic e ects can alter the expected values of such features. With articulatory knowledge directly available, a speech recognition system can take rough estimates from using such features, and appeal to the articulatory model to make further suggestions, inferences and predictions about what ndings should appear.
7.2.4 sing Prosodic Information

Another way that ARTREC can aid speech recognition is in reasoning about prosody. At the moment, ARTREC uses exaggerated motions in the mandible incisor to determine which word is being emphasized. This is only the rst type of prosodic information that could be used in ARTREC. More sophisticated algorithms should be able to detect other types of prosody occurring because of excitement, boredom and other non-linguistic e ects. To derive prosody from the acoustic signal is a very dicult problem[Waibel, 1990]. However, clues can be found in the acoustic signal. These clues can be passed along to the articulatory reasoner which can make predictions about how these prosodic types can a ect the acoustic signal. For example, emphasized words will be longer and more clearly uttered. If articulatory knowledge can be used to determine that a word is being emphasized, then factors such as lengthening and clarity can be taken into account. This would lead to a better chance that the acoustic reasoner can correctly identify the word. Similarly, if articulatory reasoning can infer that a sequence of words is being spoken rapidly, the acoustic reasoner can use this knowledge. Vowels would be reduced. There are problems in determining if a vowel is reduced or non-existent. By using knowledge of prosody, a system can know when vowels might be reduced or non-existent. Because prosody has been a problem in the past, a clear means of dealing with prosody requires rst detecting where these non-linguistic e ects come into play. Once this is detected, a system can then use representations of how these e ects alter the acoustic signal as additional knowledge.

135 erate syllables and words at higher levels. ARTREC can avoid the problems faced by many past speech recognition systems which have had to make use of coarticulation rules and other such ad hoc techniques because ARTREC can explicitly weigh the consequences of various articulatory interactions. While this is all speculation, I hope that the past results of ARTREC are convincing in three ways. First, the strategy for layered abduction is quite capable of pulling together disparate sources of knowledge in order to make clear decisions. Second, that top-down expectations are well suited for the auditory/articulatory mapping. Even though auditory to articulatory mapping is not well understood, islands of certainty created at both levels will allow for progress in the speech recognition task. Third, that articulatory recognition is a clearly feasible task (as shown by the results of ARTREC). I will now discuss several uses for articulatory knowledge in a speech recognition system.
7.2.2 aking Speech on-Linear

By allowing two separate representations for \phonetics" and \phonology", a system will be able to take advantage of both forms of knowledge. Past systems which have only modeled phonetics have used a linear theory of speech production. That is, in producing speech, one utters one unit (a phoneme) followed by another unit, followed by another. Speech production is thought of as concatenating units together. This view of speech production seems incorrect[Fujimura, 1992, Fujimura et al., 1991, Browman and Goldstein, 1988] and gives rise to several problems including the need for coarticulation rules and word boundary rules. These two types of rules have been used in the past to get around problems induced by appealing to linear phonology. By explicitly modeling articulation and articulatory features (or primitives), we can get past these problems. Rather than using phonetic primitives and segmental concatenation, a system can use parallel articulatory features which can be used to make sense of articulatory dependencies. Articulatory features are fairly independent, and can occur in parallel. Thus, an articulatory model is a much closer representation of the actual speech production mechanism than a model of phonetics.
7.2.3 Place of Articulation

Place of articulation is the location within the vocal tract where some constriction occurred. This constriction will help \shape" the acoustic sound for consonants. For instance, labial closure (or lip closure) will produce the labial stop consonants /p/ and /b/. Alveolar closure and partial closure will produce the alveolar consonants such as /t/, /d/, /s/, //, /n/ and so forth. Finding place of articulation from the acoustic signal is extremely dicult. The acoustic signal does not contain features that make place of articulation easily rec-


Figure 34: Articulatory Layer in Speech Recognition

locations as mentioned above). Then, these articulatory hypotheses can be used as leverage for further articulatory and auditory hypotheses in order to infer phonetic units. ARTREC is capable of generating certain types of expectations of articulatory hypotheses based on vocal tract motions. These expectations can be for leverage in coming to more complete conclusions at these lower levels. In this way, ARTREC becomes a module of an overall speech recognition system in which ARTREC is used to generate articulatory hypotheses from auditory data, and auditory and articulatory expectations from articulatory hypotheses. As ARTREC builds on islands of certainty, the accepted articulatory hypotheses are used to gen-

133 and the interactions involved in articulation and the alterations that such interactions will have on the acoustic signal. Some of the major stumbling blocks of acoustic speech recognition have revolved around inadequate or no explicit models of articulation, problems with identifying place of articulation, and problems with the e ects that prosody has on the acoustic signal. By separating the two types of knowledge, acoustics and articulation, a speech system can use separate features of articulation rather than concatenating segments from auditory features. By concatenating segments, a system will be forced into a linear view of speech production (and thus, a linear view of speech recognition) which is seems awed[Fowler, 1986].
7.2.1 ow ARTREC can be used to aid Acoustic Speech Recognition

ARTREC, as it exists now, is not useful for acoustic speech recognition. This is because ARTREC expects Microbeam pellet data as its input. To use ARTREC as a component within an acoustic speech recognition system, some changes are needed. Mainly, additional knowledge must be brought into ARTREC so that ARTREC can use acoustic data in addition to articulatory data. Figure 34 shows a detailed view of the articulatory layer which I propose for a large speech recognition system. Part of ARTREC's task is to map articulatory gestures to syllables. This component of ARTREC can be used in a speech recognition system with minimal changes. If we presuppose that we can infer articulatory gestures directly from an acoustic signal, then the higher level of ARTREC can be used without adaption. Thus, ARTREC can be used to play a speci c role in speech recognition, that of mapping directly from articulatory gestures to syllables. However, what if articulatory gestures are not directly available (by inference) from the acoustic signal? Obviously, Microbeam pellet data will be unavailable in real-world situations. Will articulatory recognition be of any use? It is possible now to infer some articulatory information from the acoustic signal. This can be done by using features detected in formant transition locations, burst analyses, spectral pro les and other auditory information. For example, one method for nding the place of articulation can be done by back extrapolation of the formant tracks from the vowel regions to the consonantal burst. Formant locations are derived from the acoustic signal by signal processing and auditory analysis. By using articulatory knowledge, we can make certain inferences about the tongue location at the time of the consonantal utterance, which can inform us of place of articulation. By using a layered abduction strategy as shown in ARTREC, some initial islands of certainty can be created from auditory analysis. These auditory hypotheses can then be used as leverage for further problem solving. First, these auditory hypotheses can be used to infer certain hypotheses about articulation (for instance, by using formant


Figure 33: Speech Recognition Task Decomposition Using Articulation


II or

lications o t is



I have described a system for articulatory recognition, ARTREC. I will consider how ARTREC might be used to aid speech recognition. I will argue why I believe that ARTREC is scalable, and I will indicate uses of ARTREC in other research. I will discuss the usage of Peirce or a Peirce-like abductive strategy as an enhancement for Hidden Markov Model speech recognition systems, and explain how such a combination can bypass the problems associated with HMMs. Layered abduction, and the Peirce strategy in particular, have far reaching possibilities. I will conclude with a discussion of the potential uses of Peirce to solve a variety of layered abduction problems both in perceptual areas and non-perceptual areas.
7.2 ses of Articulation in Speech Recognition

Let us rst turn to articulation and how articulation might be useful in future speech recognition work. I have shown in ARTREC that articulatory recognition is a feasible task. What I have not shown is how this task can be used to aid speech recognition. As can be seen in gure 33, articulatory recognition can be thought of as one step in the overall speech recognition problem. In [Liberman and Mattingly, 1985], it has been proposed that articulatory representation is necessary in the task of speech recognition. In fact, by separating knowledge of acoustic phonetics and articulatory phonetics, a system can increase its ability to cope with the problems introduced by articulatory dependencies and prosody. Acoustic phonetics can be thought of as deriving phonetic units (such as phonemes or syllables) from the acoustic signal by using auditory features. Articulatory phonetics can be thought of as deriving primitive articulatory units (either articulatory features such as closures, openings, tongue movements and so forth, or phonetic units such as phonemes or syllables) from the acoustic signal using knowledge of articulation. Making this distinction can allow a speech recognition system to reason about both the features in the acoustic signal 131

130 the coverage decreased in two speakers' cases, the other three speakers had modest increases in coverage. As for the results, ARTREC has been shown to be a very capable articulatory recognition system. In experiment 4, ARTREC achieved high accuracy of 96.8 . In experiments 2 and 3, I have shown that ARTREC without top-down processing can still achieve high accuracy. ARTREC is capable of detecting the prosodic e ect of corrective emphasis, although for two of the speakers, the accuracy was not very high. If more e ort had been made to create an algorithm which relied on more features, we feel that ARTREC could have achieved a much higher accuracy. While ARTREC is very limited in scope, I will argue in the next chapter that ARTREC is scalable to a much larger lexicon size and capable of handling many more types of gestures than is currently implemented. In the next chapter I will also discuss how ARTREC can be used in an acoustic speech recognition system as well as solving the articulatory recognition problem.

129 a positive one and no corrective emphasis existed1. The other condition occurred if ARTREC did not commit to an answer for some of the words of the utterance. Because ARTREC was unsure of some of these words, ARTREC would not commit to an identi cation of the emphasized word. In any other condition, ARTREC would choose a word in the utterance to be the emphasized one. Table 14: Prosodic Emphasis Accuracy Data Set Utterances Total Emphasized Utterances Correct Identi cation 80 54 39 OFJW5p 80 44 29 OFJW6pp 76 31 31 OFJW7p 78 38 35 OFJW11pp 77 35 32 OFJW12ppp 391 202 166 Totals As can be seen in the table, ARTREC varied greatly in detecting the emphasized word. For case set OFJW7p, a 100 accuracy was achieved, however, in case set OFJW6pp, a 66 accuracy was achieved. Overall, ARTREC correctly identi ed 82.18 of the emphasized words where the conditions described above were ful lled. This is a far cry from \accurate" emphasis detection. We feel that, with some revision, ARTREC can locate most all of the emphasized words; however we haven't had a chance to implement any revisions. The simplistic approach taken here is not sucient alone, but is a beginning.
6.5 Conclusion

I have tested ARTREC on a set of approximately 400 utterances from ve di erent male and female speakers. In order to demonstrate the usefulness of abduction, we have shown the variations of accuracy and coverage between four sets of experiments. These experiments have shown that abduction is a useful means of pruning hypotheses which were optimistically generated. I also hope to have shown the tradeo between accuracy and coverage, and that Peirce-built systems can range between the extremes of this tradeo by simple changes to global threshold values. I have also shown how top-down processing can aid accuracy by building on islands of certainty, and although
1 If

ARTREC failed to detect the word \no" in the utterance, then ARTREC assumes that no emphasis occurred. However, this may have been a mistake as ARTREC may have erred in not nding the word \no". This strategy was probably a mistake on my part as the system builder, but by the time I realized this problem, I had gone on to other concerns.

128 lay only in the slight improvement in accuracy between experiments 2 and 3, and the very slight drop in coverage between experiments 3 and 4. While the accuracy here is impressive, I must remind the reader that the lexicon size is very limited. A nine word lexicon is only a minute fraction of the English language. However, to o set the small lexicon, ARTREC did not have a full set of knowledge or input to work with. In fact, ARTREC was severely handicapped as it only had access to 4 or 5 points of the vocal tract. When just considering these points, many of the words in the lexicon begin to look very similar. Had ARTREC been given data for voicing, nasality, frequencies, formants, and so forth, and knowledge of syntax and semantics, the task would become very trivial. But the task was not trivial due to the limited form of input. ARTREC achieved accuracies very respectable for speech recognition. In the past, very few systems have achieved over 95 word accuracy.
6.4 Prosodic Emphasis

Another dimension to investigate is how ARTREC performed on prosodic identi cation. A very simple algorithm was used in order to explain exaggerated jaw motion by hypothesizing corrective emphasis. This algorithm looked for the jaw opening which seemed most extreme. ARTREC took into account the e ects of articulatory declination. Declination is the phenomenon whereby a speaker will have less clarity in (or fullness of) pronunciation as time goes on. This is presumably due to physical factors such as tiredness and boredom. A result of articulatory declination is that an exaggerated motion towards the end of an utterance will look much like a non-exaggerated motion towards the middle or beginning of an utterance. Thus, ARTREC's algorithm for detecting exaggerated motion must take into account the problem of declination. I implemented the declination as a sloping line. ARTREC compared all of the valleys in the mandible incisor pellet data to this sloping line. The valley that penetrated the sloping line the most was assumed to be the syllable which contained the most exaggeration. This in turn was explained as emphasis in sentences where negative responses were given. In sentences where positive responses were given, no corrective emphasis was given, and so any exaggeration was thought to be unintentional. This simplistic algorithm worked fairly well on some of the speakers. However, I found that it did not work on all of the speakers equally. The speakers who tended to speak unclearly had much more problematical \exaggeration" in their emphasis. That is, it was much less certain which word was being emphasized. As a result, ARTREC did not fare very well in identifying the emphasized word. The results of the prosodic identi cation can be seen in table 14. It is shown for each data set independently the number of utterances in which ARTREC looked for emphasis and the number of those utterances where ARTREC correctly found the emphasized word. ARTREC did not seek out emphasis in two conditions. First, the word \no" was not found. In such a case, ARTREC assumed that the answer was

127 I was quite surprised by the results of this experiment. I feel that I was on the right path to increased coverage, however the little bit of top-down processing was not enough to fully alter the results. Further forms of top-down processing would be required to allow this pessimistic ARTREC to achieve both high accuracy and high coverage.
6.3 iscussion of ARTREC Results

As I had hoped, ARTREC was able to achieve very high results in articulatory recognition. I was impressed with the high accuracies shown in experiments 2, 3 and 4. I had high hopes for this system, and I was not disappointed. Table 12 compares the four experiments in terms of the ow of processing, number of abducers and the \acceptance" criteria for hypotheses (i.e. pessimistic or normal). Table 13 shows the overall experimental results of ARTREC. Table 12: Overall Experimental Results Experiment Flow of Processing Number of Abducers Acceptance Criteria 1 Normal Experiment 1 Bottom-Up Only 2 Normal Experiment 2 Bottom-Up Only 2 Pessimistic Experiment 3 Bottom-Up Only 2 Pessimistic Experiment 4 Bottom-Up and Top-Down

Table 13: Overall Experimental Results Experiment Coverage Accuracy 89.47 89.61 Experiment 1 84.67 94.87 Experiment 2 66.26 96.07 Experiment 3 66.20 96.80 Experiment 4 There was little surprise in our experimental results. The single abduction system su ered where the layered abduction systems achieved higher accuracy. The pessimistic experiment was to demonstrate that there is a clear tradeo between accuracy and coverage and this was the result of experiment 3. The full abduction system surpassed the bottom-up layered abduction system in terms of accuracy, and the pessimistic abduction system in terms of coverage. If there were any surprises, it

126 accepting hypotheses and build upon islands of clear certainty, propagating higher level expectations to the middle level. I had hoped that experiment 4 would yield high accuracy (as in experiment 3) but also yield high coverage (as in experiment 2). The results of experiment 4 are shown in table 11. As we can see, the coverage stayed almost the same as in experiment 3 with a coverage of 66.20 and the accuracy increased a slight amount to 96.80 . Table 11: Experiment 4 Results: Full Layered Abducer Data Set Total Words Committed Correct Coverage Accuracy 933 609 588 65.27 96.55 OFJW5p 975 629 617 64.51 98.09 OFJW6pp 887 501 490 56.48 97.80 OFJW7p 961 710 681 73.88 95.92 OFJW11pp 966 677 650 70.08 96.01 OFJW12ppp 4722 3126 3026 66.20 96.80 Overall To my surprise, the results of experiment 4 were not what I had expected. I had hoped to see further coverage with the same accuracy. Instead, the coverage remained almost the same (very slight drop in coverage) while accuracy increased slightly. In two of the ve data sets, coverage actually went down while in the other three, coverage increased slightly. There was a greater accuracy. For this experiment, ARTREC would make use of islands of certainty found at the syllable level to help clear up the gesture level. However, this clari cation would have two results. First, it would make some hard decisions easier (that is, where ARTREC could not decide between two or more hypotheses from experiment 3, ARTREC could now make this decision). The reason that hard decisions were made easier was because the top-down expectations increased the plausibility of some hypotheses. However, this did not occur very often (i.e. the top-down processing did not help clear up too many hard decisions). The reason for the increased accuracy was also due to hard decisions. But in this case, previously wrong hypotheses which were accepted were not accepted in this case because the top-down expectations increased some hypotheses scores such that these hypotheses created hard decisions. That is, previously incorrect hypotheses now had plausible alternatives. Thus, ARTREC did not accept the wrong hypotheses at the gesture level, but left the decisions unresolved. What we can see from this experiment is that, ARTREC gained some coverage because the certainty helped to remove some hard decisions, but also, the certainty helped to create new hard decisions. The overall result was a slight increase in accuracy and a very slight decrease in coverage.

125 Table 10: Experiment 3 Results: Pessimistic Bottom-Up Layered Abducer Data Set Total Words Committed Correct Coverage Accuracy 933 603 577 64.63 95.69 OFJW5p 975 629 606 64.51 96.34 OFJW6pp 887 505 488 56.93 96.63 OFJW7p 961 699 671 72.74 95.99 OFJW11pp 966 693 664 71.74 92.93 OFJW12ppp 4722 3129 3006 66.26 96.07 Totals

total data. ARTREC-3 would not commit to any hypothesis which had reasonable competitors.
6.2.4 Experiment 4: Full Layered Abduction

For our last experiment, I expanded the system to include top-down processing. This processing was in the form of higher-level expectation. An expectation can be used to alter the plausibilities of lower level hypotheses. ARTREC uses the \certainty" of the accepted hypothesis as leverage. In this case, an expectation will be able to increase or decrease lower level hypotheses' plausibilities. The result of this is that some hypotheses will become clear or weak bests because their plausibilities were raised or because competing hypotheses had their plausibilities lowered. For Experiment 4, I used the same system as Experiment 3, but added top-down expectations and gave ARTREC an altered ow of processing to include 2 bottomup passes. The top-down expectations were in the following form. If a syllable was concluded after the rst bottom-up pass, expectations were created for the gestures which should appear in that syllable. For example, a \nine" syllable would expect the gestures: alveolar closure tongue low nal alveolar closure

If these gestures occurred at the approximate time location (where the syllable hypothesis would expect each gesture to occur) then the gestures' scores would be raised. After altering gestures' scores, ARTREC reran both levels. The amount that a score was raised depended on how con dent ARTREC was in accepting the syllable hypothesis. More con dent hypotheses were able to alter gesture hypothesis scores more dramatically than less con dent hypotheses. I used ARTREC-3 as a basis so that ARTREC-4 would be very pessimistic about

6.2.3 Experiment 3: Pessimistic duction ottomp Layered Ab-

In this version of ARTREC, I used the same system as in Experiment 2 in that, ARTREC had two abducers in a bottom-up ow of processing. However, an attempt was made to get better accuracy by sacri cing the amount of explanatory coverage. I made it harder for ARTREC to accept hypotheses by requiring that any accepted hypotheses be very convincing. Again, there was no top-down processing used in this experiment. Three changes made. First, essentials were accepted only if they had a minimum con dence of likely. This di ers from the previous versions in which essentials could be rated as low as neutral. Second, it is more dicult to be considered a clear best hypotheses. In the earlier ARTREC, a clear best was any hypothesis which was rated at least somewhat-likely and surpassed alternative explainers by two units of con dence (i.e. if the clear best was likely, all competitors had to be at best neutral). In this experiment, clear bests had to be rated at least likely and surpass alternatives by three units or more. I also made it harder for ARTREC to accept a weak best. In this experiment, weak bests must also have a value of likely or greater. Table 9 re ects the changes made between the \pessimistic" versions of ARTREC (both experiments 3 and 4 used the pessimistic version) and the \normal" version of ARTREC (from experiments 1 and 2). Table 9: Pessimistic versus Normal Abduction Acceptance Criteria Acceptance Criteria Essentials Clear Best Threshold Clear Best Surpassing Weak Best Threshold Normal Mode Neutral or better Somewhat Likely or better 2 units Neutral or better Pessimistic Mode Likely or better Likely or better 3 units Likely or better

The results for this experiment are shown in table 10 where we can see that ARTREC-3 achieved a coverage of 66.26 with 96.07 accuracy. A slight improvement in accuracy was achieved. This tells us that most of ARTREC's errors came from occurances where ARTREC is genuinely confused about the data, and not from selecting between hypotheses in the cases of \hard decisions". A 4 error is very slight, yet all of these errors occur because ARTREC misinterpreted the data. That is, we know that ARTREC did not err in this experiment because of choosing the wrong hypothesis during hard decisions. With the pessimistic control, ARTREC would not even attempt to make a hard decision. Therefore, each error occurred when hypotheses' scores or explanatory coverage were clearly incorrect. As can be seen, the coverage of experiment 3 is only about two-thirds of the

123 Table 7: Experiment 1 Results: Single Abducer Version Data Set Total Words Committed Correct Coverage Accuracy 933 798 709 85.53 88.85 OFJW5p 975 874 802 89.64 91.76 OFJW6pp 877 810 740 92.36 92.36 OFJW7p 961 851 768 88.55 90.25 OFJW11pp 966 892 767 92.34 85.99 OFJW12ppp 4722 4225 3786 89.47 89.61 Overall

potheses to the lower level so that the events in the pellet data could be explained either as articulatory gestures or noise (random or unintentional motions). This version of ARTREC would hypothesize the causes for the pellet motions, score these hypotheses and then construct a composite explanation. The remainder of the system is the same as in Experiment 1. There was no top-down processing used in this experiment. This version of ARTREC performed better, indicating that abduction is a very useful means of pruning away unneeded hypotheses. It also shows that, where cueing is overly optimistic (that is, when too many hypotheses are cued) some other process is needed to decide which hypotheses are relevant. Abduction is perfectly suitable for this. The results of ARTREC-2 are shown in table 8. Here, we see that overall, ARTREC-2 achieved 84.67 coverage and 94.87 accuracy. Table 8: Experiment 2 Results: Layered Abduction Data Set Total Words Committed Correct Coverage Accuracy 933 744 697 79.74 93.68 OFJW5p 975 851 813 87.28 95.53 OFJW6pp 877 745 712 84.95 95.57 OFJW7p 961 816 785 84.91 96.20 OFJW11pp 966 842 786 87.16 93.35 OFJW12ppp 4722 3998 3793 84.67 94.87 Overall Comparing the results of experiment 1 and experiment 2 shows slightly less coverage (5 ) but a slight improvement in accuracy (5 ). A 95 accuracy is respectable for the speech recognition task.

122 Gestures were hypothesized based on cues and scored based on pattern matching. Syllable hypotheses were used to explain all of the gestures, and noise hypotheses were also used to explain those gestures which did not score very high based on pattern matching. Because there was no way to restrict the cueing process, we found that a much greater number of gestures were being suggested than were needed. For example, if an event cued three di erent gesture hypotheses, all three hypotheses were considered as true and required explaining. This is an unsophisticated approach to the problem. Abduction, among other things, is a means of pruning down the number of hypotheses one has to work with. The result of an abduction should be a small set of believed hypotheses rather than a large set of potential hypotheses. Without abduction, we are either stuck with the original set of generated hypotheses or must nd some other means of pruning away some of the hypotheses. The former was the case in the rst ARTREC. This version of ARTREC made many mistakes in attempting to explain gestures which did not really exist. While noise hypotheses at the syllable level were able to explain some of the poorly rated gestures, ARTREC usually had problems when noise hypotheses were not suggested or when gestures were on the borderline between \real" and \noise". The incompatibility handling also caused problems because ARTREC would often make mistakes with the initial islands of certainty which would lead to ruling out possibly good hypotheses. The ow of processing was to generate gesture \data" from the motion data (events) and then explain the gesture \data" in terms of syllables. No noise hypotheses were generated from the motion data which meant that erroneous gestures still needed to be explained by syllable hypotheses or noise hypotheses. The syllable level was essentially the same as the current ARTREC's syllable level except that there were many more gestures to explain than in the two abducer system. There was no top-down processing used in this experiment. The results from Experiment 1 are indicated in table 7 which shows that ARTREC-1 was able to achieve 89.47 coverage and 89.61 accuracy. The initial results were bothersome. I had anticipated high results. ARTREC did achieve a reasonable amount of coverage, but a 90 accuracy is not satisfactory considering the limited lexicon. We can explain a good deal of the inaccuracy because of the need to explain the gestures, many of which were not even present. These results led us to construct a second abducer for the lower level rather than just using hypothesization for the lower level.
6.2.2 Experiment 2: ottomp Layered Abduction

In the updated system, I added a lower level abducer for forming a composite hypothesis of articulatory gestures in order to explain the events. Previously, any gesture that was hypothesized was automatically accepted as true. I also added noise hy-

121 Coverage is the number of \committed to" words divided by the number of total words. This percentage measures the amount of data that ARTREC explained. Accuracy is the number of words ARTREC identi ed correctly out of the words it \committed to". Each experiment's table will contain the Total Words, Committed, Correct, Coverage, and Accuracy. Committed is the number of words ARTREC committed to, and Correct is the number of words that ARTREC identi ed correctly. Coverage and Accuracy will be given in terms of percentages. Later in this chapter we will look at the accuracy of the emphasis identi cation. Since emphasis identi cation was not dependent on the type of experiment, we will only discuss emphasis identi cation for experiment 2. This experiment was the only one for which we accumulated emphasis accuracy statistics. Table 6 shows the 5 sets of speaker data, how many utterances were used for that data set, and the number of words contained within all of the utterances. There were between 80 and 82 total utterances per data set, but some of these were discarded due to tracking errors in the pellet data or signal processing errors which occurred during the preprocessing phase. Each utterance is an entire question/answer pair which consists of 14 words (e.g. Is it ve ve nine pine street? No, its NINE ve nine pine street). However, some of these words are not within the pellet data as each pellet data record only contains up to 5 seconds and some of the words may not have been recorded. Table 6: Data Set Description Data Set Number of Utterances Number of Words 80 933 OFJW5p 80 975 OFJW6pp 76 877 OFJW7p 78 961 OFJW11pp 77 966 OFJW12ppp


Experiment 1: Single Abduction

The initial version of ARTREC had a single abducer which would explain articulatory gestures in terms of syllables but would not attempt to explain the articulatory events at all. That is, the lower level of ARTREC was simple pattern matching so that all hypothesized articulatory gestures were considered as true (no matter of how poorly rated they were or if they were explanatorily super uous). A single abducer would then explain all of the gestures in terms of syllable and noise hypotheses. This di ers from the current version which has two abducers, one to explain events in terms of gestures and one to explain gestures in terms of syllables.

P x eri ental

I esults



In this chapter, I will discuss four experiments conducted using the ARTREC system. I will give experimental results and o er some analyses about the results. I will also discuss how ARTREC fared on emphasis identi cation. I will then turn to some proposed enhancements for ARTREC.
6.2 ARTREC Experiments and Terminology

The initial implementation of ARTREC was a single-layer abduction system. I ran all ve sets of speaker data. I was not satis ed with the accuracy of this version of ARTREC so I then added another abducer in hopes of improving ARTREC's accuracy. This second abducer was placed in the lower level to improve the quality of articulatory gestures used as ndings for the second level. To further test layered abduction, and out of curiosity, I altered the newer version of the system by making it harder for a hypothesis to be accepted. This \pessimistic" ARTREC was created by altering global threshold values in the Peirce tool. I had hoped to achieve a higher accuracy at the expense of explaining less of the data. Finally, I added top-down processing to the pessimistic version in hopes of achieving both high accuracy and more coverage than the pessimistic version. The results of this experiment demonstrate a fuller layered abduction process combining several forms of knowledge using both data-driven and hypothesis-driven processing. The results will be discussed in the following terms: how many total words were involved, how many words ARTREC committed to, how many words ARTREC correctly identi ed, ARTREC's coverage and ARTREC's accuracy. The \committed to" value indicates how many words ARTREC actually came to a conclusion about. Remember that in Peirce, hard decisions are left open. If a Peirce abducer cannot clearly pick a hypothesis to explain a datum, that datum is left unexplained. In the best case, the datum would be explained by more processing (such as when hypotheses cause other hypotheses to be ruled out due to incompatibilities, and thus removing the hard decision). In the worst case, the datum is left unexplained. 120

5.4 Conclusion

In this chapter, I have given a pictorial look at the processing of the ARTREC system. I have shown a typical sample case and described the complexity that ARTREC faces in terms of the numbers of ndings and hypotheses at each level. By having two levels of abduction, the total number of hypotheses and ndings is limited (as opposed to having syllables directly explain events which would have generated many more hypotheses and a greater overall problem complexity). I have shown that ARTREC can quickly make decisions in order to come to a conclusion. In the next chapter, I will discuss four experiments undertaken by ARTREC and the experimental results.


Figure 32: Emphasis Detection

but the typical case requires only 4 or 5 passes through the main loop. If top down processing is used, then expectations are generated at this point (based on the accepted syllable hypotheses) and ARTREC recompiles the lower-level composite followed by the higher-level composite. Typically, the most ARTREC will run through the main Peirce loop would be 10 times (requiring between two or four minutes on a SPARCstation SS1).

117 tempt to locate corrective emphasis. This should appear on only a single word (if any). Here, I will show, brie y, the emphasis detection. This will be from a di erent example as the previous one (the previous example had a positive reply as in \yes, its nine ve nine pine street" and there is no emphasis in this case). Once ARTREC has inferred the syllables to explain the motions, ARTREC reexamines the mandible motion. If the word \no" was found, ARTREC examines the syllable locations for each word following \no". If most of these words have been found, then ARTREC will seek out corrective emphasis. Corrective emphasis (and most types of emphasis) can be recognized by exaggerated mandible motion. Therefore, ARTREC examines the depth of each syllable's valley (in the mandible channel). However, ARTREC takes into account the factor of articulatory declination. ARTREC examines the valleys, and plots a shallow sloping line across the valleys. The valley which penetrates into the sloping line the most is considered to be the location of the emphasis. ARTREC explains this exaggerated motion as the syllable which received the corrective emphasis. Figure 32 shows the corrective emphasis on a typical example. In this example, the words are \nine FIVE nine pine street". You can see that ARTREC has found the valley which most penetrates the sloping line. This second valley represents the word \ ve".
5.3 Performance and Complexity

ARTREC is capable of coming to quick decisions using the Peirce algorithm. In this example, I did not show each of Peirce's problem-solving steps. A typical case will require some small number of iterations through Peirce's main problem solving loop (i.e. the loop which rst looks for con rmed hypotheses, then essentials, then clear bests and then weak bests). For event identi cation, each pellet is examined one time and ARTREC identi es each peak and valley. Next, ARTREC considers each type of gesture (such as alveolar closure, labial closure, tongue retro ection) and scans the list of events for close \template" matches. Each match (whether very close or remote) is noted as a hypothesis. Those hypotheses which do not match well are immediately ruled out. All other hypotheses are scored and explanatory knowledge is generated. In addition, noise hypotheses are suggested at the same time. The rst abducer generates an explanation of events in terms of gestures in one to two passes. The only form of hypothesis interaction is the incompatibilities between noise hypotheses and gesture hypotheses. ARTREC will come to a very quick explanation for this level. Next, gestures are used to cue syllable and noise hypotheses. After scoring, each hypothesis is assigned its explanatory coverage (again by examining the gestures). Then, the second abducer generates an explanation. Here, the abducer may run through its main loop many times (potentially as many times as there are hypotheses)


Figure 31: Best explanation of pellet data in terms of syllables


Figure 30: Syllable and noise hypotheses

5.1.3 Syllable ypothesization

ARTREC now examines the accepted gesture hypotheses and generates syllable hypotheses to explain them. Figure 30 shows the syllables hypotheses and noise hypotheses to explain the gestures previously accepted. In the typical case, between 40 and 60 syllables are hypothesized to explain most of the gestures. In addition, there are between 25 and 50 noise hypotheses generated to explain some of the more suspect gestures (note that there will be overlap between what syllable hypotheses and noise hypotheses will account for). In gure 30, you can see that there are many \ ve" and \nine" syllable hypotheses whereas there is only one hypothesis for each of \is", \it" and \street". Typically, there will be between two and ve syllable hypotheses generated for each syllable location. Since \ ve", \pine" and \nine" are all similar to a large extent, syllable hypotheses for these three syllables will typically be generated when the word is any one of ve, nine or pine. Also, because of the /s/ gesture (partial alveolar closure), \street", \is", and \its" will often be hypothesized to explain occurances of the /s/ gesture. The syllable and noise hypotheses will have been scored at this point by using Recognition Agents. Those hypotheses which were clearly implausible will have been ruled out before this point. The second abducer is now run, and it comes to the nal explanation of the motions in terms of syllables. The best explanation for this example is shown in gure 31. Here, you can see the approximate syllable (word) locations and the syllables chosen for the best explanation. It should be noted that this example used only half of the utterance. The complexity of the overall problem can be thought of as twice what was displayed here. The second half is similar, but in some ways more dicult. The diculty arises due to two independent factors. First, articulatory declination might occur which would mean that the second half of the utterance is less clearly enunciated. The result of the articulatory declination is less certain hypotheses asserted by ARTREC. The other problem lies in the words \yes its" and \no its". Especially in the case of \no its", ARTREC is given problems because these two words are so rapidly spoken. ARTREC will sometimes only suggest a single hypothesis to explain the motions here, either \no" or \its". It should be noted that ARTREC has a further hypothesis for a single word \noits" which is sometimes used in such a case. In the case of \yes its", these two words are usually more clearly spoken and ARTREC has less problems.
5.2 Emphasis etection

In the last section, I showed ARTREC running through half of a typical example. What was not shown was ARTREC's prosodic recognition. ARTREC will only at-


Figure 29: Best explanation of the events in terms of gestures


Figure 28: Noise hypotheses and what they can explain


Figure 27: Gestures hypotheses and what they can explain

110 explain are noted with small circles in the gure. You will notice that the mandible incisor data only requires the valleys to be explained. Valleys are explained by syllable hypotheses and prosodic emphasis.




The next step in ARTREC is to hypothesize the cause for the events. The cause is in terms of articulatory gestures and noise. The typical ARTREC case will have between 75 and 150 gestures hypotheses and between 50 and 100 noise hypotheses to explain the articulatory events. Out of these, most hypotheses are ruled out eventually leaving only 30 to 50 gesture hypotheses and 15 to 25 noise hypotheses to contribute to the composite hypotheses formed at this level. Figure 27 shows the gesture hypotheses and what they can explain of the events. Each gesture type is listed once, the number following the gesture in brackets notes how many di erent gesture hypotheses of the annotated type have been found for the example. You will see that most of these hypotheses are discarded after the rst abduction. Figure 28 displays the noise hypotheses that were hypothesized to explain the events where either no gesture hypotheses were available, or where gesture hypotheses were available but they were not very convincing (i.e. highly con dent). Each hypothesis has been scored at this point by using gesture templates. Clearly implausible hypotheses are ruled out before ARTREC reaches this point. The next step in ARTREC is to generate the rst composite hypothesis. This is accomplished by the lower level abducer. The input to this abducer is the sets of noise and gesture hypotheses and the set of articulatory events, the output is a small (hopefully) list of accepted gesture and noise hypotheses. Figure 29 shows the two types of hypotheses accepted into the composite hypothesis, and what each hypothesis can explain out of the events. It should be noted that gure 29 does not show explanations for all of the events. Some events are not important and are left unexplained. These events are those which fall outside of the utterance (i.e. before or after the utterance) and those which are caused from small amounts of motion. These motions are more dicult to explain because there are less distinguishing factors to clearly promote one single hypothesis. Therefore, these lesser motions are left unexplained with the hopes that they are unimportant. We have found that ARTREC can still come to a complete conclusion (in terms of the syllables) in spite of leaving some events unexplained. The composite from the lower level consists of noise and gesture hypotheses. ARTREC now discards the noise hypotheses and uses the gesture hypotheses. These are passed up to the next level to become ndings to be explained.


Figure 26: Events indicated for ARTREC example


Figure 25: Pellet data for ARTREC example

are used as data to be explained by gestures (and later syllables). The valleys for the mandible pellet are used as data to be explained by syllable locations and prosodic emphasis hypotheses. A typical case for ARTREC has between 150 and 250 events to explain. Some of these can immediately be dismissed as falling outside of the actual utterance (either before the utterance commences or after the utterance ends). The remaining events are those which ARTREC will attempt to explain. ARTREC will attempt to explain as many as it can (although complete coverage is not necessary). Figure 26 shows, for the example utterance, the events to explain. The events to

P n xa le o



This chapter will demonstrate ARTREC in action with a typical utterance. This example will be presented as a series of pictures showing the pellet data, and the parts of the pellet data being explained. The example being demonstrated here is from the OFJW12ppp data set, record number 30. This example will demonstrate ARTREC version 2 (i.e. from Experiment 2 from chapter 6) using the ow of processing described in chapter 4. The top-down processing discussed in chapter 4 is a simple extension to the bottom-up processing, but will not be demonstrated in this example. Throughout this chapter, I will refer to \typical" numbers of data and hypotheses. These typical values refer to ARTREC version 2, and not the versions 1, 3 or 4 discussed in the next chapter. By \typical", I mean the approximate number of hypotheses and data that ARTREC has to deal with for each new utterance. The example is of the utterance \Is it Nine Five Nine Pine Street?". For brevity, I will only focus on this portion of the utterance. The remainder of the utterance, \yes its Nine Five Nine Pine Street" is not shown here as part of the example. However, ARTREC works in the same way on both portions of the utterance. The utterance, as is presented to ARTREC, is shown in gure 25. This gure displays the form of the data, Microbeam pellet data. You can see the 5 pellets (tongue tip denoted as TT, tongue blade denoted as TB, tongue dorsum denoted as TD, lower lip denoted as LL and mandible incisor denoted as MAN I). All of the pellets are shown in both X and Y direction (denoted as TT X and TT Y, and so forth) with the exception of the mandible incisor in which ARTREC only uses the Y direction. The acoustic signal is also given for reader's reference only. ARTREC does not use the acoustic signal at all.
5.1.1 Events and Syllable oundaries

The rst step in ARTREC, after loading the pellet data, is to examine the pellet channels for peaks and valleys. The peaks and valleys for the tongue and lip pellets 107

106 system is to seek a word which seems to be emphasized by looking for exaggerated jaw motion. In the next chapter, I will present a detailed example of ARTREC running a typical case. The chapter after that will consider ARTREC's experimental results.

105 gesture, they are incompatible. This is because it is not possible for a gesture to be both intentional and unintentional.
Emphasis nowle ge

The last type of knowledge that the system has is knowledge pertaining to prosody. However, in our current implementation of ARTREC, the only type of prosodic information is corrective emphasis. Corrective emphasis is a form of emphasis whereby one attempts to correct someone else's question or answer. In this case, the corrective emphasis will be on one single word (syllable) in the utterance. Corrective emphasis can be found by seeking for exaggerated jaw motion. Therefore, the mandible incisor is examined. Each word is indicated by jaw opening and closing. An emphasized word will have an even greater jaw opening. However, if we only had to look for the largest jaw opening, then this would be a simple problem to solve. It is not the case unfortunately. There is a phenomenon in speech known as articulatory declination [Vayra and Fowler, 1992, Pierrehumbert, 1979], where a speaker will have more clarity in enunciation at the beginning of the utterance. As time goes on, the speaker will enunciate less and less. The emphasized word is always in the second half of the utterance (for our data). ARTREC must compare the jaw openings of the words in the second half of the utterance, and, by taking into account the articulatory declination, decide which word was emphasized. This is implemented as a straightforward strategy of nding the jaw openings, computing a sloping line across the locations of the jaw openings and determining which word penetrates furthest across the sloping line. The slope of the line is derived from empirical examination of pellet data.
4.4 Conclusion

In this chapter, I have described the ARTREC system, a layered abduction system solving the problem of articulatory recognition. The strategy used is a layered abductive strategy which attempts to explain vocal-tract motions in terms of words, noise and emphasis. The exact ow of processing is to explain vocal-tract motions in terms of articulatory gestures. Then, it uses articulatory gestures as input to be explained by syllables. Syllables are the highest level currently implemented in the system (as all of the words in our lexicon are monosyllables). The knowledge types needed to solve this task pertain to articulatory events, articulatory gestures, syllables and corrective emphasis. The same basic strategy is used at both levels of processing, generate plausible hypotheses to explain the ndings, score the hypotheses, set up explanatory information, and then use the abductive strategy in Peirce to generate a best explanation. The last step of the

104 Table 4: Syllable Recognition Features Syllable Five Nine Pine Its Is No Yes Street Initial labiodental alveolar bilabial tongue forward tongue forward alveolar tongue forward partial alveolar Middle tongue low tongue low tongue low alveolar Final labiodental nal alveolar nal alveolar partial alveolar partial alveolar tongue backwards partial alveolar nal alveolar

Table 5: Syllable Rule Out Knowledge Syllable Five Nine Pine Its Is No Yes Street Rule Out Gestures alveolar (I,F), bilabial (I), tongue backwards (M) labiodental (I), bilabial (I), tongue backwards (M) alveolar (F), labiodental (I,F), tongue backwards (M) str (F), tongue backwards (I), tongue low (I) alveolar (I), str (F), tongue backwards (I), tongue low (I) bilabial (I), str (I), labiodental (I) tongue high (F), tongue low (F) alveolar (I), str (I, F), tongue backward (M), tongue low (M) labiodental (I), bilabial (I), tongue forward (M), tongue low (M)

The noise hypotheses are rated based on how well the articulatory gesture scored. If the gesture seemed very plausible, the noise hypothesis is given less plausibility. If the gesture seemed implausible, the noise hypothesis is given a higher plausibility. Syllables which are attempting to account for gestures which are located within the same boundary are made incompatible (or mutually exclusive). This is because it is impossible to speak two di erent syllables (or words) at the same time. Therefore, ARTREC will make them incompatible so that, under no conditions will both syllable hypotheses be accepted into the composite. This is a bene cial addition to ARTREC as incompatibility handling is very useful in solving problems using the Peirce strategy. If a syllable hypothesis is accepted during the hypothesis composition phase of abduction, any incompatible hypotheses are immediately ruled out from further consideration. Noise hypotheses are also incompatible with the syllable hypotheses that they are competing against. That is, if a noise hypothesis and syllable hypothesis are attempting to explain the same


Figure 24: Syllable Boundaries

knowledge may pertain to only a portion of the syllable, and so each gesture is listed with a symbol indicating whether that rule-out knowledge should be applied to the initial section of the syllable (I), middle section (M), or nal section (F). There are also noise hypotheses at this level. Some gestures may have been accepted at the previous level even though they are not true. These gestures may have still been accepted if there is evidence to support them. However, it would lessen the explanation's con dence if there were mistakes due to mistaking unintentional motions for articulatory gestures. To safeguard against this, noise hypotheses are o ered at the syllable level. These hypotheses are used to explain single instances of gestures which do not seem coherent with other gestures in the area. For example, seeing both an alveolar closure and a labial closure in the same temporal proximity is not very likely as one would not try to accomplish these two constrictions to utter one sound. Therefore, one is probably noise. However, since both gestures can be accepted at the previous level (because they would not compete to explain some motion), ARTREC will appeal to noise hypotheses to explain one of the two gestures.

Sylla le nowle ge

The highest level of knowledge currently used in ARTREC is of syllables. We have chosen syllables as the primitive phonetic units as opposed to phonemes or other units because syllables seem more stable. Syllables are also the basic phonetic unit in Fujimura's C/D model and so the choice was made based on these reasons11 . Since syllables are the largest units currently used in the system, the task for ARTREC is to generate the best set of syllables to explain the pellet data. In particular, the syllables will be used to explain the articulatory gestures. To infer syllables from gestures, ARTREC uses the same basic strategy as it does for inferring gestures from events, hypothesization, scoring and composition. Syllable hypotheses are generated based on cues using gestures generated from the previous level. A future expansion of ARTREC will be to use a classi cation hierarchy of syllables and to classify possible syllables rather than to cue syllables from the gestures. Once syllable hypotheses are generated, they are scored by using a feature-based matching similar to how gestures are scored. Syllable hypotheses examine the area for appropriate gestures (necessary and expected gestures) and inappropriate gestures. The area of interest is the area denoted by the syllable boundaries with a little bit of overlap to the preceding and succeeding syllables. If the appropriate gestures appear and no inappropriate gestures appear within this region of interest, then a high score is assigned. This score is lessened if some wanted gestures do not appear and/or some unwanted gestures do appear. The gestures are reexamined to see what a hypothesis can explain. Syllable hypotheses are able to explain any gesture in the area that they expect. However, if the gestures are located well outside of the syllable's boundary, the syllable hypothesis is unable to explain it because it would be impossible for the hypothesized syllable to have caused that gesture. Figure 24 indicates how syllable boundaries are located by examining mandible motion, and how syllable boundaries are used to determine what gestures might be part of a syllable, and what gestures might be outside of that syllable's in uence. Tables 4 and 5 display ARTREC's recognition knowledge for syllables. Table 4 indicates the \desired" or expected gestures that will appear in each syllable type. These are given as the gestures to appear in the (approximate) initial position of the syllable, middle position and nal position. It should be noted that some syllables will not have 3 features but fewer, and so, in such a case, no middle gesture is given. ARTREC also looks for \undesired" gestures which can be used as rule-out knowledge for a syllable hypothesis. This knowledge is shown in table 5. This rule-out

hile we feel that phonemes are poor choices of phonetic units because of the linearity they impose, we are not trying to make any general statement about linear phonology in this case. e have decided to use syllables for convenience sake.

101 future work will be able to adapt these templates during runtime so that the system can in e ect, \learn" new patterns of features. Tables 2 and 3 are two examples of gesture templates. They represent ARTREC's recognition knowledge for the gestures for Final Alveolar Closure and Tongue Retro ection. The patterns are stored in order of con dent identi cation meaning that, if the top row's pattern matches then the hypothesis for \ nal alveolar closure" (or \tongue retro ection") is very con dent (rated a 0, which stands for a very close match). If this top row does not match, then the next row is considered and so on until no rows match. If none of the rows match, the hypothesis is assigned automatically ruled out of consideration. Note that the values in tables 2 and 3 are not actual measurements, but are normalized values within a range of -25000 and 25000. A value of 25000 is the most extreme high value while -25000 is the most extreme low value. Also note that these two gesture templates make use of only absolute threshold values. More complex gestures make use of slopes and relative thresholds. Table 2: Gesture Template for \ nal alveolar closure" Features TTY, TBY, TDX TTY, TBY, TDX TTY, TBY, TDX TTY, TBY, TDX TTY, TBY, TDX Threshold Values Con dence > 20000, > 15000, > 15000 0 > 18000, > 12000, > 12000 1 > 15000, > 10000, > 10000 2 > 15000, > 10000, > 8000 3 > 12000, > 8000, > 8000 4

Table 3: Gesture Template for \tongue retro ection" Features TTY, TTX, TDX, TBX TTY, TTX, TDX, TBX TTY, TTX, TDX, TBX TTY, TTX, TDX, TBX TTY, TBX
> > > > >

23000, 20000, 18000, 18000, 17000,

< < < < <

Threshold Values Con dence -10000, > 10000, < -5000 0 -6000, > 8000, < -5000 1 -2000, > 6000, < -3000 2 0, > 4000, < 0 3 0 4

Noise hypotheses are used to explain events where gesture hypotheses seem implausible. That is, in cases where gesture hypotheses do not t the templates well at all, noise hypotheses are also considered.


Figure 23: Abstract Gesture Template

data channels, rough slopes of the peaks and valleys and the space between peaks and valleys, rough durations, and the height or depth of peaks and valleys. Peaks, valleys and slopes are stored as threshold values that must be surpassed. Both absolute threshold and relative thresholds are used. Absolute thresholds state that the motion must surpass a value referring to a position in the mouth (for example, labial closure means that the lower lip must reach the upper lip, and therefore an absolute threshold is used). Relative thresholds state that a motion must move a certain amount relative to another motion (for example, in /str/, the tongue tip must slide from a forward location in the mouth to a location behind the alveolar ridge, this requires a motion with a relative distance). Gesture templates were written during a process of examination of pellet data while the ARTREC system was being written. These templates are stable although

99 Table 1: Gesture Types Gesture Bilabial Closure Labiodental Closure Alveolar Closure Final Alveolar Closure Alveolar Partial Closure Alveolar Glide Tongue Retro ection Tongue Raising Tongue Lowering Tongue Forward Tongue Backward Closure-Retro ection Type (/p/) (/v/,/f/) (/n/,/t/) (/n/,/t/) (/s/) (/y/) (/r/) (/s/, /n/, /t/) (/ay/) (/i/,E10 ) (/ow/) (/str/) Features LLX (HP), LLY (P) LLY (HP), and upwardslope > 1 TTY (HP) TTY (MP to HP), TBY (HP) TTY (P), TTX (HP), TBX (P) TTY (P), TBY (P), TDY (P) TTY - (VHP), TTX, TBX - (LV) TTY (P), TBY (P) TTY (V) TTX (HP), TBX (P) TTX (VLV), TBX (V), TDX (V) Alv. Part. Clos. Tongue Retro ec.

Several articulators can be used in conjunction to create a gesture. Gestures can persist in duration from a single motion (such as labial closure8) or through many motions (such as a fricative-alveolar-glide motion9). ARTREC attempts to explain the pellet data events in terms of articulatory gestures. The process for this is one of hypothesization, scoring and composition. Table 1 indicates the features sought for each gesture currently present in ARTREC. In this table, P stands for peak, HP for high peak, VHP for very high peak, MP for mid peak, V for valley, LV for low valley and VLV for very low valley. The \type" refers to the type of utterance (phoneme or syllable) that the gesture is typically found in. Figure 23 shows an abstract gesture template. Gesture templates are stored as RAs. The output of a gesture template is the con dence that ARTREC has that a given set of motions is actually the gesture in question. Hypothesization of gestures is accomplished by a cueing process. A peak or valley (a pellet data event) in one channel is caused by a gesture. ARTREC uses knowledge of which gestures cause which motions and hypothesizes the possible gestures that could be responsible for the event being examined. These gestures are then scored, based on local match knowledge. This knowledge is based on a feature template where features are types of motions in various channels. The gesture score is based on how closely the data matches the gesture template. The types of motions that make up features are peaks and valleys of the various
8 Labial 9A

closure occurs when the lower lip touches the upper lip

fricative-alveolar-glide motion is responsible for the sound /str/ as in the word \street".

Event nowle ge

Events are motions of interest in the pellet data. The events are the data items that we want explained. Consonants are formed by constrictions in the vocal-tract. Vowels are formed by lengthy releases of air while the vocal-tract is relatively stable. In examining pellet data, we have found that both consonants and vowels are typically indicated by peaks and valleys in the pellet data. Therefore, we restrict our explanation of pellet data to only the events of interest, peaks and valleys. A simple algorithm follows each pellet and generates a list of peaks and valleys. A smoothing algorithm is used so that slight bumps or depressions around peaks and valleys are not generated. That is, if a peak is found right next to another, lesser peak, the lesser peak is ignored. The channels used in ARTREC refer to the following articulators:
 Tongue Tip in the X and Y direction  Tongue Blade in the X and Y direction  Tongue Dorsum in the X and Y direction  Lower Lip in the X and Y direction  Mandible Incisor in the Y direction

X motions refer to the articulator moving forward and backward in the mouth. A peak in the X direction is the articulator moving forward in the mouth to an extreme point, and then moving backward. A peak in lower lip refers to the lips pursing. A valley in the X direction is the articulator moving backwards in the mouth to an extreme and then moving forward again. Y motions refer to the articulator moving up and down in the mouth. A peak in the Y direction refers to the articulator moving up in the mouth, reaching an extreme and then moving downward. For the tongue tip, a peak is usually the tongue tip touches the alveolar ridge, or coming close to it. A peak in lower lip is usually the lower lip touching the upper teeth or the upper lip. A valley in the Y direction refers to the articulator moving downward, reaching an extreme and moving upward. A valley in tongue tip means that the tongue has moved low in the mouth. A valley in lower lip or mandible incisor means that the mouth is opening. These event types are used to generate possible articulatory gesture hypotheses.
esture nowle ge

Articulatory Gestures are qualitative descriptions of vocal-tract motions. They are groupings of events which together, perform some useful act towards an utterance.

97 Note that we do not need to automatically accept con rmed hypotheses if we feel our local match knowledge might lead us astray, but in ARTREC's second level, con rmed hypotheses are automatically included. Ruled-out hypotheses are the lowest possible ratings and are hypotheses that either found none of the expected features among data (or ndings) being examined, or found data or ndings which were impossible for that hypothesis. Hypotheses which are rated as ruled-out are dismissed from any future consideration. The 5-valued set uses the integers 0, 1, 2, 3, and 4. These values are assigned to gesture hypotheses depending on how closely the hypothesis matches its template. A 0 is a very close match, close enough to be convincing that the gesture did actually occur (like con rmeds above). A 4 is a very low score similar to a \unlikely" rating and is given to any gesture which did not match the template at all. Intermediate scores of 1, 2, and 3 are given to any hypotheses which matched to some intermediate degree. Any potential gesture hypotheses which did not even match with the score of 4 is immediately discarded as \ruled-out". A 5-valued set seems sucient, at least at present, for the gesture plausibilities. Typically, an abducer will only want to accept highly rated hypotheses, such as any hypothesis given a rating higher than somewhat-likely. However, in the case of ARTREC, neutral hypotheses can be accepted as \weak bests" and somewhatunlikely hypotheses can also be accepted under some circumstances. It should be noted though, that in accepting low rated hypotheses, our con dence in the composite is weakened.
4.3.5 Types of Articulatory nowledge

ARTREC models certain types of knowledge to accomplish the task of articulatory recognition. These kinds of knowledge are articulatory events, articulatory gestures, syllables and the prosodic information.

are the motions that we wish to explain. Events are articulator movements. We only wish to explain extrema (peaks and valleys).
estures are qualitative descriptions of vocal-tract motions. Gestures are groupings of events, all located in some close proximity (in time) and are interrelated.


are \phonetic primitives". Syllables are composed of groups of gestures (one or more gestures). Syllables can then be composed into words.
Sylla les

is the single form of prosody currently being recognized. Corrective emphasis appears as exaggerated mandible (jaw) motion. This form of emphasis (and other forms of emphasis) can be detected by examining the jaw motions for large valleys.
orrective emphasis

4.3.4 Con dences

ARTREC uses what are known as the 9-valued con dence set and the 5-valued con dence set. These are the vocabularies that ARTREC will use to indicate how plausible a hypothesis is. These plausibilities can be thought of as qualitative probabilities. They are not detailed numerical values derived from Bayesian equations or statistical matching methods. They are, instead, derived based on how closely the hypothesis matches the situation at hand, but given as abstract statements of con dence (such as \likely" or \unlikely"). A tool known as RA (for Recognition Agent) is used to capture local match knowledge. An RA is designed for each concept or hypothesis that the system use. When faced with a decision about the relevancy (or plausibility) of a hypothesis, the RA for that hypothesis is invoked. The RA returns a con dence value. For instance, an RA for the hypothesis \apical closure" will examine the events in the area and make a determination of how likely \apical closure" is. An abstract RA can be found in gure 22. In the next section, RA information from ARTREC will be given.

Figure 22: Recognition Agent The 9-valued set uses the values: con rmed, very-likely, likely, somewhat-likely, neutral, somewhat-unlikely, unlikely, very-unlikely and ruled-out. Con rmed hypotheses are the highest possible ratings, and are considered as hypotheses which match so well that they should automatically be placed in the composite hypothesis.

95 Expectations can take on two forms, positive expectations (i.e. those hypotheses we expect to be true) and negative expectations (i.e. those hypotheses we expect to be false). Currently, ARTREC only uses positive expectations. These expectations are then used in order to clarify the middle level. It is possible that several gesture hypotheses were neither believed (accepted) nor disbelieved (ruled-out), the top-down expectations can help further determine the status. Expected gestures should be more believed while unexpected gestures should be less believed. After propagating these expectations downward, the rst level abducer can be rerun to reexamine the gesture hypotheses and the events in the light of the conclusions already reached. After coming to a new conclusion at the middle level, the entire composite is passed up to the next level and syllables are rehypothesized to account for the newly accepted composite. The higher level is then rerun. Hopefully, this additional processing will be able to further the explanation. It should be noted that other forms of top-down processing can occur. Currently, the only form is what has been described here. However, another form of top-down processing is in using hard decisions. If a choice for explaining some gestures is dicult because only a couple of syllables are suggested, but they are equally (or nearly equally) plausible, then di erentiation knowledge can be passed down to the lower levels. This form of processing has been discussed in our research group but not yet implemented. If the necessity arises for this additional form of top-down processing, it will be fairly simple to implement.



The output of ARTREC will be a sequence of syllables (words) and ARTREC's choice for the emphasized word. Each syllable (word) will be accompanied by its temporal location within the utterance (i.e. the approximate time that the syllable was uttered). Also, one nice feature of Peirce's strategy is that of maintaining the acceptance criteria (or justi cation) of an accepted hypothesis. Each syllable is displayed with its plausibility and its justi cation (con rmed, essential, clear best or weak best). This allows us to examine the output of ARTREC and not only determine where ARTREC erred, but also why ARTREC erred. If ARTREC makes errors when accepting essentials or con rmeds, this alerts us to the fact that ARTREC is not correctly judging (scoring) hypotheses7. A detailed example of ARTREC, demonstrating the bottom-up processing described here is given later in this chapter.
7 It

was because of too many incorrect essentials that led us to implement noise hypotheses.

94 gesture to explain some event. Noise hypotheses are scored based on how many gestures present in one area seem implausible. And again, noise hypotheses are made to be incompatible with syllable hypotheses which are attempting to explain the same gestures.
igh evel uction

The second level abducer is now run to generate a best explanation in terms of syllables for the gestures. Again, the abducer will run through a cycle going from very con dent local abductions to less con dent ones. In this abducer, the rst step is to look for con rmed hypotheses. Con rmed hypotheses are those hypotheses which were rated with the highest possible plausibility in local scoring. The plausibility of \con rmed" is only given if a hypothesis nds all of the features it is seeking and none of the impossible features. After seeking con rmeds, the abducer goes on to nd essentials, clear bests and weak bests, propagating the e ects of any believed hypotheses and starting the cycle over with con rmeds when any new hypotheses are added to the growing composite. The explanation generated from this abducer represents the set of syllables, in order, which ARTREC has determined are the causes of the original motions in the pellet data. Currently, ARTREC only has monosyllabic words. Therefore, once syllables have been generated, there is no need to continue through higher levels of abductive problem solving. Words are equivalent to syllables.
Emphasis etection

The last step in the bottom-up processing is to explain the jaw motion (mandible motion). This is accomplished in two phases. First, jaw openings are considered as syllables (or words). After nding the syllable boundaries form the earlier phase, this knowledge is left aside until actual syllables have been identi ed. The best explanation from ARTREC explains both the pellet data thus far examined (i.e. tongue motions, labial motions) and the jaw motion. However, some of the jaw motion might be exaggerated. The most exaggerated motion will probably be due to emphasis. To explain this exaggeration, ARTREC uses an emphasis hypothesis.
opown rocessing

Top-down processing may now commence. Top-down processing will use the \islands of certainty" generated by the bottom-up processing to drive expectations to the lower levels. This happens as follows. Given an accepted syllable (word) hypothesis at the top level, expectations of gestures are generated. These expectations are passed down to the articulatory gestures level.

93 The next step in the process is to explain the articulatory gestures. The system now has a set of semi and very plausible articulatory gestures to explain. The noise hypotheses will not need explaining as they have already been explained (as noise, unintentional motions). As the rst level used articulatory events to generate hypotheses, the second-level will use articulatory gestures to generate syllable hypotheses. The system uses knowledge of which syllables can cause which gestures. This information is straightforward as a syllable is composed of some number of gestures. To utter the syllable requires using most or all of those gestures. Given the set of gestures, ARTREC generates the list of syllables which could have caused those gestures. Currently in ARTREC, the set of syllables is very small re ecting the small lexicon (in actuality, the syllables are the words of the system since the lexicon is composed only of monosyllabic words). Once syllable hypotheses have been generated, they are scored in terms of plausibility in a similar fashion as the gestures. Syllables hypotheses examine the gestures in the area and generate a plausibility based on the presence and absence of gestures in the area. A syllable will require certain gestures (known as necessary gestures), expect certain gestures (known as expected gestures) and expect to not see other gestures (known as impossible gestures - which means that if this gesture appears, it is impossible or unlikely for the syllable to have occurred). Plausibility scores are based on locating or not locating these types of gestures. Naturally, if the necessary and expected gestures appear and the impossible gestures are not found, a syllable hypothesis is given a high plausibility. If the necessary gestures are not found or impossible gestures appear, then a hypothesis is given a low score (or ruled out). Finally, if some combination of necessary gestures appear, but not all of them, or some impossible gestures appear even though all of the necessary gestures are found, then the plausibility is set to a neutral, or middle value. This indicates that, based on the accepted gesture hypotheses, the analysis was very uncertain. After scoring the syllable hypotheses, each hypothesis considers the gestures and decides what gestures it could potentially explain. A syllable hypothesis can explain either a necessary gesture or an expected gesture which is found in the region. Further, a syllable hypothesis cannot explain a gesture which is found too far away, one which seems to be within an adjacent syllable. It should be noted that syllables can overlap each other to some extent in that. One syllable may attempt to explain a gesture located within the boundaries of a bordering syllable, but only if that gesture will not interfere with the pronunciation of that bordering syllable.
itional oise ypothesization

Again, noise hypotheses are generated. These hypotheses are used to explain articulatory gestures which seem out of place. Such gestures occur either as random (unintentional) motions, or because the lower level mistakenly accepted an incorrect

92 examined to generate gesture hypotheses, another type of hypothesis is created when necessary, a noise hypothesis. Noise hypotheses can be thought of as alternative explainers for events which do not seem intentional. However, determining if a motion was intended or unintended is dicult, at least when examining individual motions. It is easier to detect if a motion was intentional or not when examining many motions. Fortunately, the very nature of the Peirce strategy encourages this behavior making it easy to implement. An event which does not closely match the idealized gesture template can be considered as noise and a noise hypothesis is assigned. The noise hypothesis is given a plausibility depending upon how far away the match was from the gesture template. Events which closely match templates will generate implausible noise hypotheses while events which do not closely match templates will generate very plausible noise hypotheses. A noise hypothesis will only account for a single event, which is substantially di erent from a gesture hypothesis which could potentially explain many events. Also, noise hypotheses are made to be incompatible with gesture hypotheses which attempt to explain the same articulatory event.
ower evel uction

The next step in the process is to explain the events in terms of gesture and noise hypotheses. The abducer generated from Peirce will do this. As described in the previous chapter, the abducer will run through cycles of explanation going from the most certain and easiest explanations to less certain and harder explanations. The cycle will seek out essential hypotheses followed by clear best hypotheses and nally weak best hypotheses. If hypotheses are accepted as believed at any time, the e ects of believing the hypotheses are propagated and the cycle starts again as a search for essentials. One nice feature of Peirce is that the system builder has the ability to adjust the degree of con dence versus explanatory coverage. That is, if we would like to generate a more con dent answer, the result would most likely be less coverage whereas a more fully explained set of data would generate some less con dent conclusions. This ability of Peirce was used in an experiment to determine how accurate ARTREC could be. This variation is discussed in the next chapter.
Sylla le ypothesization

After the rst abductive explanation has been generated, the system has composed a set of plausible gesture hypotheses (and noise hypotheses) to account for the pellet data motions. This is obviously an insucient level of detail for us. We would prefer to recognize words out of the vocal-tract motion instead of gestures. Therefore, we need additional inference.

or oun ary ocation

After generating the events, ARTREC examines the mandible incisor events. ARTREC tries to nd opening/closing motions in this channel (the mandible Y direction). ARTREC infers that each opening occurs at the beginning of a new syllable6. ARTREC will then attempt to explain these syllable locations in terms of syllable hypotheses generated later.
rticulatory esture ypothesization

Given the events (extrema) found in the pellet data, the system must then explain these individual motions in terms of articulatory actions of which the events are a part. Articulatory gestures are used to explain events. Articulatory gestures are qualitative descriptions of vocal-tract events. Such gestures include apical closure, dorsal closure, tongue lowering, labial closure, and so forth. Articulatory gestures can be thought of as primitive motions required for speech sounds. These motions occur within words or syllables and can be described as a set of features along parallel channels. Gestures will be described in fuller detail later in this chapter. Articulatory gestures are hypothesized based on the needs of the data. Each event is examined, and a list of possible causes (gestures) is generated. Each gesture hypothesis is assigned a plausibility (con dence score) by local match knowledge. This score is based on how close the gesture comes to that gesture's feature template. After assigning plausibility scores to the gesture hypotheses, each hypothesis examines the event data and determines exactly which events it would be able to explain. This entails nding peaks or valleys that the gesture would typically expect. The events would be located nearby in time to the temporal location of the gesture. However, a gesture can also attempt to explain nearby events which might have resulted due to the in uence of the gesture. Therefore, a gesture hypothesis will be able to potentially explain many events in the temporal region around the gesture hypothesis.
oise ypothesization

Because some motions are not created by the utterance of a sound, but created by accident or due to non-linguistic e ects, the system needs a way to determine if an event should be accounted for by a gesture (as a meaningful vocal-tract motion) or as noise (some random or unintentional motion). Therefore, while events are being
6 It is not strictly

true that each new syllable is re ected by jaw openings in general, but it is suitable for the monosyllabic words in our current lexicon. In fact, each word of our current lexicon is noted by jaw motion and therefore, this is a very stable feature. hen ARTREC is expanded, we will need to determine whether this feature still holds true.


Figure 21: ARTREC's ow of processing

direction although sometimes as peaks in the X direction and sometimes as valleys). Vowel-like sounds are typically realized by steady positions of all pellets. However, these will also appear as peaks or valleys as the vowel sounds are surrounded by other motions so that the vowel-sounds appear at one end of the extremes while the consonantal-sounds appear at the other end of the extremes. Because of this, we need only explain the extreme motions found in the pellet data5 . The motions in between are the articulators moving from a peak to a valley or from a valley to a peak, and can be explained as the transition from one sound to the next.

5 It

should be stated that this strategy of only explaining peaks and valleys is su cient for the pine street data. e may have to reconsider this strategy when we move on to new sets of data.

89 pellets throughout the utterance.
4.3.2 Lexicon for Pine Street

The lexicon for ARTREC is currently very small, 9 monosyllabic words. The number of articulatory motions are limited based on the words in the lexicon. The only form of prosodic information currently sought is that of corrective emphasis. The small lexicon size re ects the limited number of words in the Pine Street data. Since we have only tuned ARTREC for the Pine Street data, only those words currently appear in ARTREC's lexicon. Those words are \is", \it", \ ve", \nine", \pine", \street", \yes", \no", and \its". While this makes the problem easier, this set of words is ideal for an initial implementation because the words are similar in many aspects, especially when considering the only source of input as being four or ve points in the vocal-tract. The words \ ve", \nine" and \pine" are similar in that they both contain the vowel /ay/. \Nine" and \pine" are extremely similar, di ering only in the initial sound (initial alveolar closure versus initial bilabial closure). \Is", \it" and \its" are extremely similar and gave ARTREC a very hard time in di erentiation between them.
4.3.3 etailed escription of the Flow of Processing

ARTREC's ow of processing is described in detail here. Figure 21 shows ARTREC's overall ow of processing. Each step is described in its own subsection.
Event eneration

ARTREC rst inputs the pellet data. Each pellet's data is stored in two channels of information, one representing the X direction and one representing the Y direction. Some of the channels have di erent sampling rates (i.e. a di erent number of pellet position values are taken for di erent channels). All channels are made to have the same number of scans by extrapolating points in the channels where less scans were available. ARTREC then considers each channel individually. ARTREC follows the data detecting peaks and valleys. Some smoothing is done so that \bumps" are ignored4 . The peaks and valleys found are then considered the vocal-tract \events" to be explained. We can attempt to explain all of the data, however it is sucient to explain only the peaks and valleys. Consonants are created by some form of constriction in the vocal-tract. These will appear as peaks or valleys (usually as peaks in the Y

bump can be considered a location where a peak occurs next to a larger peak, or a valley occurs next to a larger valley (within 10 or 20 milliseconds of each other). nly the larger peak or valley is identi ed as an event, the lesser peaks or valleys are ignored as unimportant.


Figure 20: Task Description of ARTREC

acquired new pellet data for initial enhancements. In our current system, we make use of 4 or 5 pellets representing the tongue tip, tongue blade, tongue dorsum, lower lip and mandible incisor (which can be considered as a jaw pellet)3. No other input is used such as acoustic signal, voicing, nasality (which we could only get through a velum pellet), vocal-tract size, positioning of pellets on the articulators, or speaker-dependent information (sex, dialect, age). Instead, the input to the system consists solely of the motion for the four or ve
3 The

data we have consists of 5 pellets for 3 speakers and 4 pellets for 2 speakers. In the cases where only 4 pellets were used, no tongue blade pellet was used.

4.3 ARTREC nowledge-based System for Articulatory

Recognition of Speech

Using the Peirce tool, we have constructed a multilayered abduction system for the task of articulatory recognition called ARTREC (ART for articulatory, REC for recognition). The overall task is described as follows:
he he nput:

Vocal-Tract motions

An English sentence consisting of ARTREC's best explanation for the input. Included in this output is ARTREC's con dence in each word hypothesis and, if possible, identi cation of the word which is (most) emphasized. ARTREC's task description can be found in gure 20.




The initial set of data run on ARTREC has been dubbed the \Pine Street" data. This is because all of the data is of the form: \Is it ve ve nine pine street?" \no, its ve NINE nine pine street." The data contains ve sets of speaker data. Each set consists of approximately 80 les, each containing an utterance like the one above. All combinations of ves and nines were used (i.e. combinations are 559, 599, 999, 955, 959, and so forth). Both positive and negative answers were given (a positive answer being \Yes, its ve ve nine pine street"). In the case of negative answers, one of the numbers was emphasized. This is known as corrective emphasis. In correcting someone's error, we tend to emphasize a word or words in order to indicate where the speaker erred. And so, in the above example utterance, the rst \nine" in the reply is being emphasized to correct the speaker who thought the number was \ ve". The reason for accumulating this data was to determine what e ects corrective emphasis might have on the articulation of the words. This data was perfectly suitable for initial attempts at articulatory recognition for the following reasons. First, we wanted a small but tricky set of words. Second, since we were not using any acoustic signal in our system, we made the problem hard but achievable. Third, while there are only 9 words in the lexicon, there is a sucient number of articulatory gestures present that have interesting features. This was a sucient number to adequately test the layered abduction mechanism. We envision enhancements, and have already


Figure 19: Microbeam Pellet Data from the Pine Street data set

Y direction represent top-to-bottom motions in the mouth. Each pellet is placed approximately in the mid-sagittal plane, except for the reference pellet placed on one of the lower molar teeth. There are only two Microbeam facilities in the world, the rst one constructed is in Japan, the other is located at the University of Wisconsin in Madison. The data we use for ARTREC has come from Madison. The Microbeam facility in Madison does not use a velum pellet and none of our current data has a voicing channel2.
2 Although

we can obtain voicing by analyzing the acoustic signal that comes with the pellet data for voicing information. or the time being however, we do not make use of the acoustic channel at all, and so we make do without any kind of voicing information.

85 sample point of the subject's vocal-tract during the course of a single utterance. Because this technique is limited in its abilities, only certain locations of the vocaltract can be measured. Motions of interest in the vocal-tract are primarily limited to the following:
 Tongue Motion: tongue motion consists of between two and four somewhat

independent regions of the tongue including the tongue tip, the tongue blade and the tongue dorsum. In the case of two pellets, the tongue blade pellet might be omitted. In the case of four pellets, two tongue blade pellets might be used. lip.

 Labial Motion: labial motion is for the lower lip, or both lower lip and upper  Mandible Motion: mandible incisor is one of the front teeth. This motion de nes

the motion of the jaw as the incisor moves with the lower jaw. Another pellet can be placed on the mandible molar so that a complete picture of jaw motion can be derived from the two points. positions bring them close to each other. For the sake of brevity, the vocal fold vibration (in the glottis) will be considered as voicing rather than the acoustic consequence of this vibration (which is also called voicing, i.e. voicing is a term which is identi ed in the acoustic signal as well as the act which creates the voicing). To acquire voicing data, EGG readings can be taken, or one can detect voicing from the acoustic signal. sounds (e.g. /n/). These motions can be x-rayed by placing a pellet on the velum, reached through the nose.

 Voicing: voicing is caused by vocal fold vibration which occur when the folds'

 Velar: velar motions occur behind the nasal cavity and are responsible for nasal

These articulators are the (primary) elements of the vocal-tract responsible for shaping sounds. The combined motions along with releases of air cause sounds which are then captured in acoustic signals. The output of the Microbeam facility is not a complete examination of the vocal-tract, but rather sample points within the vocaltract (between four and ten) showing the vocal-tract motion across time. An example of the pellet data for an utterance can be found in gure 19. You will note that the pellets are listed in both X and Y direction. LL stands for lower lip, TT for tongue tip, TB for tongue blade, TD for tongue dorsum and MAN I for mandible incisor. The x-ray data is stored as a collection of coordinates. Each articulator (i.e. pellet position) is stored as two separate les, one for the X direction and one for the Y direction. The X-Y coordinate plane can be thought of as running across a pro le of a face. There is no data pertaining to the width of the face. Motions in the X direction represent front-to-back motions in the mouth while motions in the

84 a large multi-layered system. Articulatory recognition is a perfect problem to try to implement as layered abduction because it is rich in interacting hypotheses and explanatory knowledge is readily available. Also, articulatory recognition o ers the same \layered-ness" as acoustic speech recognition. More importantly, as motivation, is the attempt to aid new speech science research. Osamu Fujimura is attempting to develop a new, non-linear1 theory of speech production. This Convertor/Distributor model[Fujimura and Wilhelms, 1991] is a prime motivation for our research. It has been hoped that the model of a layered abduction system can incorporate the C/D model as its basic outline or backbone. We can use such a system to \debug" the C/D model and the C/D model to help construct the system. In this way, both sets of research can symbiotically aid each other. A discussion of the C/D model and how abduction might aid the C/D model is presented in chapter 7.4.
4.2.2 Articulatory Input icrobeam Pellet ata

The best lip readers only gain access to two or three channels of articulatory information, namely the lip motions, jaw motion, and glimpses at the tongue. This is not sucient for the articulatory recognition without other information or clues (such as syntax, discourse, semantical understanding). Fortunately, there is other, more complete, data available in the form of Microbeam pellet data. X-ray Microbeam Pellet Data [Westbury and Fujimura, 1989] is obtained by the following process. The subject speaker has small gold pellets (2-3 millimeter in diameter) placed on various articulators in the vocal-tract. These pellets are the targets to be x-rayed. The Microbeam machine then x-rays the subject's vocal-tract while the subject speaks. The result is digital records of vocal-tract motions. Since x-rays are dangerous, a clever tracking algorithm is used by the Microbeam machine so that it does not randomly beam x-rays at the subject, but rather, searches for the pellet based on its last location and its velocity. This minimizes subject exposure time to the x-rays. The Microbeam machine will scan the regions where the pellets were last scanned, taking x-rays at these locations. The scanning rates for each pellet di er depending upon the pellet's location within the vocal-tract. For example, since the tongue tip is very active, the Microbeam will scan the tongue tip more often than the other articulators such as the jaw or lips. Typically, tens of thousands of brief (10 microsecond) exposures are taken for each
1 The

term \non-linear" refers to the aspect of the theory in which speech is not caused by a linear concatenation of phonemes. Rather, speech is created by independent motions of the speech organs interacting to create parallel strings of features. These features are the components or primitives of speech. Most past theories of speech production have used phonemes as the primitive units, and have concatenated them together to represent a string of output. Even syllable concatenation has aws because it still restricts speech to be a linear ordering of parts.

83 of linguistic and non-linguistic factors which a ect the production (or articulation) of speech. Because prosody a ects the way we speak, it a ects the acoustic signal. Prosodic e ects arise out of \real-world" situations such as speaking quickly because you are excited, speaking with a cold, speaking with emphasis on some of the words (corrective emphasis or teaching for instance), speaking loudly to be heard, speaking under the in uence of alcohol, being bored or tired. All of these non-linguistic \conditions" a ect the way we articulate, which results in signi cant changes to the acoustic signal. One example of prosodic control causing problems in speech recognition is due to emphasis. It has long been a problem with speech recognition systems that emphasis and stress will alter the spectral characteristics of sounds making it dicult to recognize unless the emphasis is somehow captured in recognition knowledge[Waibel, 1990]. However, it seems unreasonable to contain a second set of \sound templates" which can be used for emphasized words. In fact, emphasis comes in varying degrees and manners and therefore, multiple sets of emphasized \sound templates" would be required. This would complicate any speech recognition systems' task. To understand how prosody can a ect the acoustic signal, we must see how these situations (such as emphasis, speaking quickly, etc...) a ect articulation. One aspect of our research is to not only recognize lexical items from the articulatory input, but to also attempt to recognize when and where prosodic e ects come into play. We have several ideas of how prosody will a ect articulation, and in order to fully determine these, implementing prosodic recognition via articulation is a good starting place. Articulation is the process of creating sound through motions of the vocal-tract. There are dependencies which arise in articulation that cause substantial changes to the acoustic signal. In the past, these dependencies have been ignored except by implementing some form of coarticulation rules. However, these rules have never been sucient because they are ad hoc in that the rules are designed to x the problems that arise due to insuciently modeling of articulation[Lowerre and Reddy, 1980, Erman, 1980, Fujimura and Lovins, 1978]. Because articulatory dependencies are poorly understood, it makes it dicult to know what knowledge is appropriate for recognition [Fujimura, 1991]. It is our hope that by explicitly modeling articulation and reasoning over this model that a speech recognition system can handle the e ects of articulatory interactions and prosody on the acoustic signal. Our primary motivation for our research is to investigate how to model knowledge pertaining to articulation and prosody. We hope that the end result is not only a better understanding of these forms of knowledge, but also, to learn how dependencies caused by articulation and prosody will a ect the acoustic signal. The result of this research will be better knowledge for acoustic speech recognition. We have two other motivations for our research. One is the desire to construct layered abductive problem solvers in the problem area of perception. We have attempted modest speech recognition systems but never got far enough to construct

82 The overall task of articulatory recognition can be broken into subtasks, just as acoustic speech recognition is broken into layers of subtasks. However, articulatory recognition is simpler, not requiring some of the lower-level details of auditory analysis or speech signal processing. The articulatory recognition subtasks are:
 Input the data (vocal-tract motions)  Examine the data for articulatory events  Infer articulatory gestures out of these events  Infer phonetic units from articulatory gestures (where phonetic units can be

phonemes, demisyllables or syllables) boundaries as a result of words

 Combine phonetic units into lexical units (using a lexicon) and deriving word  Use syntactic, semantic, and other high level knowledge to critique the string

It should be noted that articulatory recognition is very novel. There are only a couple of examples of a similar system, Nelson [Nelson, 1979] and later Greenewald et al[Greenewald, 1990] constructed articulatory annotation systems in order to automatically annotate Microbeam pellet data. Greenewald's system, ArticTool, used both articulatory and acoustic input as well as a phonetic description of the utterance. Its task was, by using simple pattern matching rules, to determine where in the utterance, each word and each articulatory gesture occurred. This annotation problem was much simpler than the recognition task as the utterance was given as part of the input.
4.2.1 Why Articulation

Articulatory recognition is not necessarily a perceptual task. Humans have no way of directly observing most articulatory motions, with the exception being lip motion and jaw motion. Other motions which are needed to infer articulation are not always visible. A natural question arises. Can articulatory recognition be of any use during acoustic speech recognition? To answer this question, let us consider the problems with acoustic speech recognition. The speech signal is inherently variant. This has been a problem for all past speech recognition systems. Much of the variation is because of the e ects that prosodic control has on the acoustic signal. Prosody is traditionally, the study of rhythm and meter in speech. Phonetic discussion of prosodic aspects of speech signals have focused primarily on the fundamental frequency contour (F0, or voice pitch). However, prosody occurs because


Figure 18: Acoustic Recognition versus Articulatory Recognition

P - a eco nition ste a ered usin

I duction eec

rticulation as In ut



In the last chapter, I described a domain-independent strategy for both abduction and layered abduction, and a tool, Peirce, for constructing abductive problem solving agents. In this chapter I will describe a layered-abduction articulatory-recognition system, ARTREC, constructed from the Peirce tool. I will rst describe the task of articulatory recognition and attempt to motivate our research. I will then discuss a source of input for this task. I will turn to ARTREC, describe the experimental data currently being used by ARTREC, give a detailed description of ARTREC's ow of processing, and a detailed look at the knowledge of ARTREC. An example of ARTREC will be given in the next chapter.
4.2 Articulatory Recognition

Articulatory recognition is a task of identifying lexical units from articulatory input. This is similar to acoustic speech recognition (or just speech recognition), however, the input di ers, and to some extent, the knowledge content di ers. Acoustic speech recognition uses an acoustic signal as input. Articulatory input consists of vocal-tract motions. vocal-tract motions are the actions that create the speech signal. Figure 18 shows the comparison between the two tasks. You can see that much of the task is abstractly the same, inferring words from spoken utterances by using many types of knowledge. The di erences lie in the input and the lower-level knowledge. Acoustic speech recognition uses knowledge pertaining to auditory concepts, using acoustic features and spectral characteristics of the signal in order to derive (infer) the auditory concepts. Articulatory recognition must make use of knowledge of how sounds arise by shaping those sounds in the vocal-tract. The end result of either task is to infer or derive lexical units (words). And both tasks can make use of other types of knowledge (phonetic, syntactic). 80

79 The usefulness of Peirce cannot be overstated. Peirce is one of several generic problem solving tools that have been researched by the LAIR in the past. In conjunction with other tools such as CSRL for hierarchical classi cation and RA for hypothesis matching (scoring), powerful problem solving systems can be constructed which range in task from explanatory systems, diagnostic systems, design systems, perceptual systems and more. With a collection of such tools, many types of problem solving can be captured provided the domain knowledge is available. In the next chapter, I will demonstrate a particular system constructed from Peirce and other tools to solve a perceptual problem.

 Uninterpretable data can be doubted and rescored. That is, if a nding is

unaccountable at one level and it was passed up from a lower level, then this nding (which is a hypothesis at a lower level) can be reconsidered. It can be rescored and, if necessary, removed from the composite hypothesis from the lower level.

 Jointly uninterpretable data can be considered incompatible and recomputation

of the composite can be forced. This is a situation where two incompatible hypotheses are considered essential. Obviously, there is an error somewhere. The error may reside in the data that they are attempting to explain. In such a case, the data can be discarded and the composite reformed without one or both of the hypotheses.

The use of top-down knowledge is to aid the overall problem solving. This comes in the form of opportunistic problem solving so that partial solutions can be passed around, or so that problematical solutions can be caught and dealt with.
3.6.4 Implementation Issues

As stated above, Peirce is well-suited for layered abduction. In order to construct a layered abduction system, one need only de ne an abducer for each level, then connect the levels together by having one level look to the next lower level abducer's output for its input. This can be thought of as a series of cascaded abducers (in the same way that a ripple adder propagates results to the next adder). Communication from higher levels to lower levels is allowed by means of expectations and reinvocations of lower level abducers as the problem state changes. In addition, each abducer will require a means of hypothesis evocation and hypothesis instantiation. Any number of methods can be used to tackle these two subtasks.
3.7 Conclusion

In this chapter I have shown a particular strategy for abduction consisting of hypothesis evocation, hypothesis instantiation and hypothesis composition. Based on past research in the LAIR, many useful ideas have come out pertaining to abduction. These ideas can be used to construct an opportunistic, island building strategy for hypothesis composition. Peirce is exactly such a tool. I have shown the domain independence of Peirce by considering several systems constructed from Peirce. Finally, I have shown how Peirce is perfectly suitable for solving problems in both abduction and layered abduction settings.


Figure 17: Parallel Peirce

76 Finally, the layered abduction strategy accommodates a parallel implementation. Initial processing would occur at the lowest level until a partial solution is found. This partial solution can then be propagated upwards while processing continues at the lowest level. As other levels reach partial conclusions, their explanations are passed upwards while their expectations are passed downwards. If each level is implemented as an abducer on a separate computer processor, then this version of the control is simple to implement and communication occurs as message passing between processors. Peirce is well-suited for this form of processing. One additional feature necessary to implement a parallel layered abduction strategy is that the abducer would go into a suspended mode if ndings or expectations are not yet available. An example of parallel abducers is shown in gure 17. It should be noted that the parallel layered abduction is especially suited for speech recognition. A portion of the acoustic signal (or articulatory signal if we consider ARTREC) can be given to the lowest level abducer. This abducer will generate a partial (or complete) explanation for the portion of the acoustic signal. This solution is passed along to higher level abducers and the lowest level abducer begins to operate on the next portion of the acoustic signal. As the lowest level abducer comes to conclusions, they are passed along to other levels. The other levels will continue to work until they also achieve a conclusion which they can pass along to higher level abducers, and/or generate expectations to pass along to the lower level abducer. Very little control is necessary in this picture because all of the intermediate abducers will have the same processing task, to achieve a (partial) conclusion for any input, and if a conclusion cannot be reached, then query the lower level abducers with expectations or questions. In principle, a parallel abduction system is not only possible, but potentially very ecient.
3.6.3 ses of ownward-Flowing Processing

Downward owing processing comes in the form of expectations and problem solving guidance. There are many reasons to use downward owing processing.
 Data-seeking needs which may arise if a hypothesis at one level needs to deter-

mine if data actually exists at a lower level. This need will arise if a hypothesis' plausibility is in question. In such a case, the higher level can prompt the lower level to reexamine the data, to see if the data exists or not. This may require test ordering (in a diagnostic situation) or reexamining the visual or acoustic input (in perceptual situations) or seeking new means of gathering data (in theory formation situations). lower levels to prune away irrelevant hypotheses or to aid in hard decisions.

 Expectations based on rmly established hypotheses can be passed down to

75 avoided. In theory formation, hypotheses are typically tested out by the expectations of the theories. In such a case, much of the processing occurs in a top-down fashion where expectations help score lower level hypotheses. It is unfortunate that there is no built-in control, however, as this is additional work for the system builder. But in spite of this problem, there is a great deal of exibility in using Peirce for layered abduction as any form of control is possible, whether the control is to allow individual abducers to run in parallel, serially in an bottom-up fashion, some complex pattern of bottom-up and top-down processing or in a middle-out fashion. The exibility in Peirce comes from the fact that each level's abducer is independent of all other levels' abducers. There exists dependencies as one level's conclusion becomes the next level's ndings, however, partial solutions can be propagated upwards or downwards while other abducers work on partial solutions. Higher level abducers that come to partial conclusions can attempt to aid lower level abducers by generating expectations. Lower level abducers can pass along partial solutions and await either expectations from above, further solutions from below, or some form of prompting for guidance or veri cation.
3.6.2 Flows of Processing

There are many possible forms of processing control that can be used in a layered abduction system. They are all variations of the same theme, bottom-up and topdown. The simplest control form is strictly bottom-up. Here, ndings are introduced and explained at one level. The explanation is o ered as ndings for the next level to explain. This explanation is o ered up to the next level. This process continues until the highest level reaches an explanation. A variation of this control introduces top-down processing. When one level (other than the lowest level) reaches a conclusion, its conclusion is sent upwards to the next level and expectations are generated to be sent to the next lower level. These expectations are based on what should be found out of the data based on the hypotheses that were concluded. The lower level which receives expectations can use these expectations to aid problem solving if some portion of the ndings have yet to be explained. Or, veri cation can take place where the expectations can con rm whether the lower level explanation is consistent with the expectations. Another form of processing is a middle-out method. HWIM [Wolf and Woods, 1980] used a method where islands of certainty were found initially, and this allowed some initial conclusions at a higher level (relating to the lexical level), although not at the highest levels (relating to syntax and semantics). HWIM would then use upward and downward owing processing to advance all levels. Communication would then be based on the completion of solutions, owing upwards from the low levels and downwards from the high levels.

74 to a more abstract conclusion, or a conclusion using a di erent form of knowledge. This is highly useful in many types of reasoning because the world is a complicated place in which there is a wide range of knowledge types, and abstractions are found everywhere. For example, in speech recognition, we can come to a conclusion about the speech signal. It is composed of bursts of sound, sonorant regions, noisy regions, and so forth. However, this conclusion is insucient. It is at an entirely inappropriate level of detail. We could also conclude phonetic features such as velar pinch, voicing and formant frequency. This too is inappropriate to recognize speech. We could even make a conclusion of the phonetic units involved and o er a string of such units. However, without word boundaries, this too would be an inappropriate conclusion. If instead, we could infer lexical items from this data (i.e. words), we would feel better about accepting a conclusion. Similarly, in diagnosis, it is not always sucient to explain symptoms in terms of diseases or malfunctions. We would feel better if we understood causally how the symptoms and malfunctions came about. In diagnosing why a nuclear power plant had to be shut down, we don't want to simply know that there was a rise in core temperature, we want to know how this malfunction came about. Finding deeper and deeper causes, deriving causal stories, explaining hypotheses in terms of new theories or laws, all of these tasks require a fuller problem solving environment than a simple one-layered mapping. All of these problems will require multiple inferences, across, potentially, many kinds of knowledge. Our problem solvers require the ability to reason in such a way. Layered abduction is a method for solving these types of tasks.
3.6.1 sing Peirce for Layered Abduction

The strategy discussed in the Peirce section is perfectly adaptable for accomplishing layered abduction tasks. To use Peirce for layered abduction, one would create an abducer for each level in the problem. Then, one would link abducers together in an upward owing fashion so that the best explanation generated by one abducer would become the ndings to be accounted for by the next abducer. Each level would need a separate source for instantiated hypotheses. The hypothesis composition process is taken care of by the Peirce abducer. Peirce does not have a built-in control strategy for layered abduction. This is both fortunate and unfortunate. It is fortunate because the control strategy for layered abduction is, in part, domain speci c. For instance, in perceptual domains, little top-down processing seems to occur at the lower levels. These levels are so highly compiled that most of the processing presumably occurs only in a bottom-up manner. Top-down processing seems to come into play only when ambiguities are found to exist or enough high level knowledge is available that some lower level processing can be

3.5.2 Is Peirce a Plausible echanism for Abduction

Several questions arise with the discussion of the Peirce algorithm. How realistic is it? Do humans really solve abductive problems by such a mechanism? How should this strategy be changed for problems which make use of highly compiled knowledge (such as perceptual problems)? Many of the ideas behind Peirce come from analyzing human problem solving. Hard decisions are typically delayed in interpretation problems. Essentials and islands of certainty are used to nd purchase in dicult problems. Further, the ability to adjust a solution based on the need for coverage versus the need for accuracy is clearly exhibited in problems such as diagnosis. Because of these factors, Peirce seems perfectly reasonable as a solution. However, I am not trying to say that we all have little Peirce algorithms running in our heads as we solve problems, but rather that the types of strategies used by Peirce seem similar to the ways that we solve abductive problems. In perception, however, an algorithm like Peirce seems too deliberative. Is there sucient time for a Peirce-like strategy, one which requires us to propose hypotheses, score hypotheses and then compose a single composite explanation to account for the ndings we face? The answer to this is very controversial. It has been posited that perception is inferential by many, and argued by others that perception cannot allow for any form of deliberation. If the answer is \no, there is no time for inference" then obviously a Peirce-like algorithm is not appropriate. Yet, the answer to this question remains unknown. It seems plausible to suggest that perceptual tasks require some inference, which brings about a second question: is the algorithm for Peirce suitable for perception? The thesis of this document is that perceptual problems can be solved by using layered abduction, and the Peirce strategy in particular, as a method. As will be shown in the next chapter, we can solve portions of the speech recognition problem by using layered abduction. It remains to be seen whether the entire speech recognition problem can be solved in such a way, but the results discussed here are encouraging.
3.6 Layered Abdu

Lecture03Introduction to Filters_图文.pdf

Systems" Journal of the Acoustical Society of America, Volume 63, Number 5...of zeros = N-1 Imaginary Axis 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 ...


基本信息 姓名: 金明 出生年月: 1963 年 1 月 20 日 毕业院校: Peking ... The Journal of the Acoustical Society of America, 2003, Vol.114, No.3...


制冷学报,2011,32(5):1-5. 4罗二仓,戴 囊,胡剑英,等.一种双作用单级...Fig.9Heatabsorbedby refrigerntom[M】.AcousticalSocietyof AmericaPublications, ...


部件 3 AAC acoustical-absorption coefficient 吸音...American Society of Civil Engineers 美国土木工程师...底座脚 244 FVD fire volume damper 防火调节阀 ...

HVAC 专业术语...doc

部件 3 AAC acoustical-absorption coefficient 吸音...American Society of Civil Engineers 美国土木工程师...底座脚 244 FVD fire volume damper 防火调节阀 ...

2015 年、 2016年可接收外国留学研究生指导教师情况表-朴胜春_....pdf

博士留学生__1__名; Doctor Candidates 1 ? 硕士...th Meeting of the Acoustical Society of America....(ICINA);Volume: 2 V2480-V2483 ;October 17, ...

Theory of Propagation of Elastic Waves in a Fluid-Saturated ....pdf

A. BIOT Reprinted from THE JOURNALOF THE ACOUSTICAL SOCIETYOF AMERICA,Vol....1. INTRODUCTION is to establish a theory of propagation of elastic waves ...

Proceedings of Meetings on Acoustics.pdf

Published by the Acoustical Society of America ...85.5 Longitude (deg W) F IGURE 1: Bathymetry...Proceedings of Meetings on Acoustics, Vol. 19, ...

Guidelines for the reduction of underwater noise from ....pdf

The discrete noise peaks are caused by the volume fluctuations of the sheet...the Acoustical Society of America (ANSI/ASA) S12.64-2009 "Quantities and ...

...in ocean wave energy conversion A simplified model of ....pdf

2010 Acoustical Society of America [DOI: 10.1121/1.3449331] Received 19 May 2010; published 20 May 2010 Proceedings of Meetings on Acoustics, Vol. 9,...


Acoustical Society Institute of Acoustics American ...1 SUBJECT 2 SUBJECT 3 SUBJECT 4 SUBJECT 5 ...

...Conservation for Nonlinear Acoustical Vortices_图文.pdf

? 2003 The American Physical Society 244302-1 VOLUME 91, N UMBER 24 ...k 2 2c0 c0 (13) (14) Hence, the acoustical pseudomomentum of a Gauss...

Lens-System Diffraction Integral Written in Terms of Matrix ....pdf

Vol. 60 JOURNAL OF THE OPTICAL SOCIETY OF AMERICA VOLUME 60, NUMBER 9 ...system is ()NPN)N A Bf DI ip1\ xo then the diffraction integral1 ...

synthetic aperture son....pdf

The Journalof the AcousticalSocietyof America 799 Downloaded 15 Feb 2012 to...Fro. 2. Schematic figureof an experimental device. 800 Volume 54 Number3 ...


In Proc. IEEE, vol. 63, pp 678-692. Bibliography Bibliography Referenced ...Journal of the Acoustical Society of America 89(1): 425-434. Cardoso, J...

Acoustical performance of an electrostrictive polymer film ....pdf

? 2000 Acoustical Society of America. S0001-4966...The closed plenum volume below the grid is at ...1. Principle of electrostrictive-polymer ?lm ...

Schmidhuber's lab.pdf

Journal of the Acoustical Society of America, 105(1):476{480, 1999. 3]...Conference on Arti cial Neural Networks (ICANN'98), volume 1, pages 443{...

[62] Jacob Nachmias. Visual Resolution of Two-Bar Patterns ....pdf

Radio Engineers, volume 48, pages 858{865, 1960...Journal of the Optical Society of America A, 1...the Acoustical Society of America, 49(2):467{...


of volume data,” in Adaptive Optics: Analysis ...(Optical Society of America, 2007), paper DWB7...eld Acoustical Holography (Academic Press, 1999)....

Some Observations on the Influence of F 0 and Duration to the....pdf

Language and Speech 24: 128. Gussenhoven, C. and Blom, J. G. 1978...Journal of the Acoustical Society of America, Volume 90: 97102. Van ...