Below is the unedited penultimate draft of:
Norris D., McQueen J. M., Cutler A., (2000) Merging information in speech
recognition: Feedback is never necessary
Behavioral and Brain Sciences 23 (3): XXX-XXX.
This is the unedited penultimate draft of a BBS target article that has been accepted for publication (Copyright 1999: Cambridge University Press U.K./U.S. -- publication date provisional) and is currently being circulated for Open Peer Commentary. This preprint is for inspection only, to help prospective commentators decide whether or not they wish to prepare a formal commentary. Please do not prepare a commentary unless you have received the hard copy, invitation, instructions and deadline information.
For information on becoming a commentator on this or other BBS target articles, write to: bbs@soton.ac.uk
For information about subscribing or purchasing offprints of the published version, with commentaries and author's response, write to: journals_subscriptions@cup.org (North America) or journals_marketing@cup.cam.ac.uk (All other countries).
Merging information in speech recognition:
Feedback is never necessary
Dennis Norris
Medical Research Council Cognition and Brain Sciences Unit,
15, Chaucer Rd.,
Cambridge, CB2 2EF,
U.K.
Dennis.Norris@mrc-cbu.cam.ac.uk
http://www.mrc-cbu.cam.ac.uk/
James M. McQueen and Anne Cutler
Max-Planck-Institute for Psycholinguistics,
Wundtlaan 1, 6525 XD Nijmegen,
The Netherlands
James.McQueen@mpi.nl
and Anne.Cutler@mpi.nl
http://www.mpi.nl
Abstract:
Top-down feedback does not benefit speech recognition; on
the contrary, it can hinder it. No experimental data imply that feedback loops are
required for speech recognition. Feedback is accordingly unnecessary and spoken word
recognition is modular. To defend this thesis we analyse lexical involvement in phonemic
decision-making. TRACE (McClelland & Elman 1986), a model with feedback from the
lexicon to prelexical processes, is unable to account for all the available data on
phonemic decision-making. The modular Race model (Cutler & Norris 1979) is likewise
challenged by some recent results however. We therefore present a new modular model of
phonemic decision-making, the Merge model. In Merge, information flows from prelexical
processes to the lexicon without feedback. Because phonemic decisions are based on the
merging of prelexical and lexical information, Merge correctly predicts lexical
involvement in phonemic decisions in both words and nonwords. Computer simulations show
how Merge is able to account for the data through a process of competition between lexical
hypotheses. We discuss the issue of feedback in other areas of language processing, and
conclude that modular models are particularly well suited to the problems and constraints
of speech recognition.
Keywords: feedback, modularity, phonemic
decisions, lexical processing, computational modelling, word recognition, speech
recognition, reading,

Dennis Norris (left) is a member of the senior scientific staff of the Medical Research Council Cognition and Brain Sciences Unit, Cambridge, UK. James McQueen is a member of the scientific staff of the Max-Planck-Institute for Psycholinguistics, Nijmegen, The Netherlands. Anne Cutler is director (language comprehension) of the Max-Planck-Institute for Psycholinguistics, and professor of comparative psycholinguistics at the University of Nijmegen. All three work on the comprehension of spoken language, in particular spoken-word recognition. Between them they have authored over 220 publications in this area, of which 47 are joint publications of at least two of them; this is the tenth paper on which all three have collaborated.
Psychological processing involves converting information from one form to another. In speech recognition - the focus of this target article - sounds uttered by a speaker are converted to a sequence of words recognized by a listener. The logic of the process requires information to flow in one direction: from sounds to words. This direction of information flow is unavoidable and necessary for a speech recognition model to function.
Our article addresses the question of whether output from word recognition is fed back to earlier stages of processing such as acoustic or phonemic analysis. Such feedback entails information flow in the opposite direction - from words to sounds. Information flow from word processing to these earlier stages is not required by the logic of speech recognition, and cannot replace the necessary flow of information from sounds to words. Thus it could only be included in models of speech recognition as an additional component.
Occam's razor instructs theorists never to multiply entities unnecessarily. Applied to the design of processing models, this precept excludes any feature which is not absolutely necessary to enable the model to account for observed data. We argue that models without the feedback we have just described can account for all the known data on speech recognition. Models with feedback from words to earlier processes therefore violate Occam's razor.
Nevertheless many theorists have proposed such models; the question of whether there is feedback from word recognition to earlier acoustic and phonemic processing has even become one of the central debates in the psychology of speech recognition. In this article we consider the arguments that have been proposed in support of this feedback, and show that they are ill-founded. We further examine the relevant empirical data - studies of how listeners make judgements about speech sounds - and show that the evidence from these studies is inconsistent with feedback and strongly supportive of models without this feature.
However, we argue that no existing model, with or without feedback, adequately explains the available data. Therefore we also propose a new model of how listeners make judgements about speech sounds. This model demonstrates how information from word processing and from earlier processes can be merged in making such judgements without feedback from the later level to the earlier.
In the psychological literature on speech recognition, models without feedback are generally referred to as ``autonomous'' and models with feedback as ``interactive''. In autonomous models, each stage proceeds independently of the results of subsequent processing. In interactive models, feedback between any two stages makes these stages interdependent. (Strictly speaking it is not models in their entirety, but stages of each model, to which the terms should be applied; Norris, 1993. A model may have a mixture of autonomous and interactive stages.) Models with only autonomous stages have only feedforward flow of information, and this is also referred to as ``bottom-up'' processing, while feedback in interactive models is also referred to as ``top-down'' processing. Note that the only type of feedback which is at issue in the speech recognition literature, and in the present article, is feedback from later stages which actually alters the way in which an earlier stage processes its input. The specific question is: does information resulting from word (lexical) processing feed back to alter the immediate operation of prelexical processes (all the processes which intervene between the reception of the input and contact with lexical representations)?
The debate about feedback in psychological models of speech recognition has centred on evidence from experimental tasks in which listeners sre required to make phonemic decisions (judgements about categories of speech sounds). Such tasks include (1) phoneme monitoring, in which listeners detect a specified target phoneme in spoken input, e.g. respond as soon as they hear the sound /b/ in ``brain'' (see Connine & Titone, 1996, for a review); (2) phonetic categorization, in which listeners assign spoken input to phoneme categories (see McQueen, 1996, for a review); and (3) phoneme restoration, in which listeners hear speech input in which portions of the signal corresponding to individual phonemes have been masked or replaced by noise (see Samuel, 1996, for a review).
This section describes two example models in detail, to illustrate the structural differences we described above. The two models we have chosen represent the extreme positions of the feedback debate in this area. They are the interactive theory TRACE (McClelland & Elman 1986) and the autonomous Race model (Cutler & Norris 1979).
The evidence on the relationship between lexical and prelexical processes seems at first glance to support the case for feedback. Many studies show convincingly that there are lexical influences on tasks involving phonemic decisions or phoneme identification.
In a wide range of tasks, phoneme identification is influenced by lexical context. For example, phoneme monitoring is faster in words than in nonwords (Cutler, Mehler, Norris & Seguí 1987; Rubin, Turvey & van Gelder 1976). In sentences, phoneme monitoring is faster in words that are more predictable (Foss & Blank 1980; Mehler & Seguí 1987; Morton & Long 1976). Lexical biases are also observed in phonetic categorization tasks. Ganong (1980) generated ambiguous phonemes on a continuum between a word and a nonword (e.g. type-dype ), and found that subjects are biased toward classifying phonemes in the middle of this range so as to be consistent with a word ( type ) rather than a nonword ( dype ). In phoneme restoration, listeners' ability to determine whether phonemes are replaced by noise or simply have noise added is worse in words than in nonwords (Samuel 1981; 1987; 1996).
An apparently simple and appealing explanation of these results is that lexical influences come about as a direct result of lexical processes exerting top-down control over a prior process of phonemic analysis. This is exactly what happens in the TRACE model of McClelland and Elman (1986). TRACE has three levels of processing. Activation spreads from the feature level to the phoneme level, and from there to the word level. In addition, activation of each word node feeds back to its constituent phonemes in the phoneme layer. (The relationship between lexical and phonemic processing in TRACE is thus directly analogous to the relationship between word and letter processing in Rumelhart and McClelland's (1982) Interactive Activation Model [IAM] of reading.)
Because of the top-down connections in TRACE, phonemes in words are part of a feedback loop which increases their activation faster than that of phonemes in nonwords. TRACE can thus readily account for the finding that phoneme monitoring responses to targets in words tend to be faster than those to targets in nonwords. Phonemes in nonwords will also receive some feedback from words that they partially activate. Therefore even phonemes in word-like nonwords will tend to be activated more quickly than phonemes in nonwords which are not similar to words. TRACE explains lexical effects in phonetic categorization in the same way. When listeners are presented with input containing ambiguous phonemes, top-down activation from the lexicon will act to bias the interpretation of the ambiguous phoneme so that it is consistent with a word rather than with a nonword, exactly as observed by Ganong (1980). Likewise, the top-down connections in TRACE can explain why phonemic restoration is more likely in words than in nonwords (Samuel 1996).
An important feature of TRACE is that it is what Cutler et al. (1987) have described as a single-outlet model. The only way TRACE can make a phoneme identification response is by read-out from a phoneme node in the phoneme layer. One immediate consequence of this single-outlet architecture is that when presented with a mispronunciation, TRACE is unable to identify the nature of this mispronunciation. Although a mispronounced word would activate the wrong set of input phonemes, there is no way that the error can be detected. The mispronunciation will reduce the overall activation of the word node, but the system will be unable to tell at the lexical level which phoneme was mispronounced, because it has no independent representation of the expected phonological form against which the input can be compared. Because top-down feedback will act to correct the errorful information at the phoneme level, the only representation of the input phonology which is available in the model will also be removed. Indeed, a similar reduction in activation of a word's lexical node coupled with top-down activation of the phonemes corresponding to the word could arise from unclear articulation or noise in the input. Somewhat ironically, the feature which thus allows TRACE to fill in missing phoneme information is precisely what prevents it from detecting mispronunciations. In contrast, the Race model, to which we now turn, does have independent lexical representations of the phonological forms of words. The presence of these representations makes it possible to detect mispronunciations, freeing the model from the need for feedback.
The findings described above are consistent with the idea of interaction between lexical and prelexical processes. But although the assumption of interaction fits with many peoples' intuitions about the nature and complexity of the speech recognition process, it is certainly not forced by the data. In itself, the fact that lexical information can affect the speed of, say, phoneme detection is neutral with respect to the issue of whether lexical and prelexical processes interact; it is simply evidence that both lexical and phonemic information can influence a response. Information from two separate sources can influence a response without the processes delivering that information necessarily influencing each other.
This is exactly what happens in the Race model of Cutler and Norris (1979), and, as will be discussed later, in other models without top-down feedback. According to the Race model, there are two sources of information which can be used to identify phonemes: identification can occur via a prelexical analysis of the input or, in the case of words, phonemic information can be read from the lexical entry. Thus, in contrast to the single-outlet assumption of TRACE, the Race model is a multiple-outlet model.
Responses in phoneme monitoring are the result of a race between these two processes. As in all first-past-the-post race models, the response is determined solely by the result of the first route to produce an output. The mean winning time of a race between two processes with overlapping distributions of completion times will always be faster than the mean of either process alone. So phoneme monitoring will be faster for targets in words than for targets in nonwords because responses made to targets in words benefit from the possibility that they will sometimes be made on the basis of the outcome of the lexical process, whereas for targets in nonwords, responses must always depend on the prelexical process. The Race model offers a similar explanation for the lexical bias observed in phonetic categorization. An ambiguous phoneme in a type-dype continuum, for example, will sometimes be identified by the lexical route and sometimes by the phonemic route. The contribution of the lexical route will lead to an overall bias to produce a lexically consistent response. Hence the repeated simple demonstration of lexical involvement in a range of phonemic judgement tasks does not distinguish between interactive and autonomous models; both TRACE and the Race model can account for all of the basic effects.
Discussions of bottom-up versus interactive models frequently seem to draw on an implicit assumption that top-down interaction will help performance, but does it? In this section we examine this assumption, and conclude that the feedback in models we discuss is not beneficial.
In models like TRACE (or Morton's 1969 logogen model), the main effect of feedback is simply to alter the tendency of the model to emit particular responses. The top-down flow of information does not help these models to perform lexical processing more accurately.
Consider first of all the issue of how feedback from lexical to prelexical processes might facilitate word recognition. The best performance that could possibly be expected from a word recognition system is to reliably identify the word whose lexical representation best matches the input representation. This may sound trivially obvious, but it highlights the fact that a recognition system that simply matched the perceptual input against each lexical entry, and then selected the entry with the best fit, would provide an optimal means of performing isolated word recognition (independent of any higher level contextual constraints), limited only by the accuracy of the representations. Adding activation feedback from lexical nodes to the input nodes (whether phonemic or featural) could not possibly improve recognition accuracy at the lexical level.
To benefit word recognition, feedback would have to enable the system to improve on the initial perceptual representation of the input. A better representation would in turn improve the accuracy of the matching process. For example, having isolated a set of candidate words, the lexical level might instruct the prelexical level to utilize a specialized, more accurate, set of phoneme detectors rather than general-purpose detectors used in the first pass. But note that the plausibility of such a theory depends on the questionable assumption that the system performs a sub-optimal prelexical analysis on the first pass. If it does the best it can first time round, there is no need for feedback.
An interactive model that improved input representations would operate in a manner analogous to the verification models which have been proposed for visual word recognition (e.g. Becker 1980; Paap, Newsome, McDonald & Schvaneveldt 1982). However, to our knowledge, no such model exists for spoken-word recognition. All models with feedback involve flow of activation from lexical to phoneme nodes, as in TRACE. In this kind of model, which we will call interactive bias models, interaction does not have a general beneficial effect on word recognition, although it can influence phoneme recognition. This is confirmed by Frauenfelder and Peeters' (1998) TRACE simulations which showed that the performance of TRACE does not get worse when the top-down connections are removed (i.e. approximately as many words were recognized better after the connections were removed as were recognized less well).
In general, although interactive bias cannot assist word recognition, it can help phoneme recognition, especially when the input consists entirely of words. If the /n/ in the middle of phoneme cannot be distinguished clearly by the phoneme level alone, interactive bias from the lexical level can boost the activation for the /n/ and make it more likely that the phoneme will be identified. Of course, if the input were the nonword phomeme instead, the biasing effect would impair performance rather than help it, in that the mispronunciation would be overlooked. That is, interactive bias models run the risk of hallucinating. Particularly when the input is degraded, the information in the speech input will tend to be discarded, and phonemic decisions may then be based mainly on lexical knowledge. This is because top-down activation can act to distort the prelexical representation of the speech input (Massaro 1989a). If there is strong top-down feedback, the evidence that there was a medial /m/ in phomeme may be lost as the lexicon causes the phoneme level to settle on a medial /n/ instead. In fact, repeated empirical tests have shown that mispronunciations are not overlooked, but have a measurable adverse effect on phoneme detection performance (Koster 1987; Otake, Yoneyama, Cutler & van der Lugt 1996; Gaskell & Marslen-Wilson 1998).
It is important to note that the biasing activation makes the lexically consistent phoneme more likely as a response. One might argue that this is a useful property of the model designed to deal with normal speech where the input does, of course, consist almost exclusively of words in the listener's lexicon. However, given that the interaction does not help word recognition, it is not clear what advantage is to be gained by making the prelexical representations concur with decisions already made at the lexical level. Once a decision has been reached at the word level, there is no obvious reason why the representations which served as input to the word level should then be modified. The only reason for feedback would be to improve explicit phoneme identification. But even this reason has little force since autonomous models offer an alternative way for lexical information to influence phonemic decisions, one which does not suffer from the disadvantages caused by introducing a feedback loop into the recognition process.
In interactive bias models, therefore, lexical information can sometimes improve phoneme identification and sometimes impair it. In verification models, on the other hand, lexical information should always improve phoneme identification in words. As Samuel (1996) has demonstrated in the phoneme restoration task, subjects' ability to detect whether or not phonemes in noise are present is poorer in words than nonwords. Moreover, this lexical disadvantage increases throughout the word. That is, in this task, lexical information introduces a bias which reduces the performance of the phoneme level. Samuel (1996) points out that this finding is exactly what TRACE would predict. However, it is also exactly what the Race model would predict. According to the Race model, if the lexical route wins when the phoneme has been replaced by noise, this will produce an error, but the lexical route should never win for nonwords.
Despite the fact that this result is exactly what would be expected on the basis of the Race model, Samuel (1996) interprets it as evidence against the Race model, arguing that lexical effects in the Race model can only be facilitatory. He reasons that because the lexical effect in the phoneme restoration study is the reduction of the discriminability of phonemes and noise, the lexical effect is not facilitatory, and hence contradicts the Race model. This is incorrect, however. Because lexical information races with phonemic information, lexical effects must certainly always have a facilitatory effect on phoneme monitoring latencies to targets in words, but the race will not facilitate all aspects of phoneme perception. If the lexical route produces a result even when the phoneme has been replaced by noise, the listener will have difficulty determining whether there really was a phoneme in the input, or just noise. The lexical route facilitates identification of the underlying phoneme, but this in turn impairs the listener's ability to discriminate the phoneme from the noise.
Hence Samuel's (1996) study does not discriminate between TRACE and the Race model, or between interactive and autonomous models in general. The real significance of these restoration results is that they appear inconsistent with more active forms of interaction, such as the one discussed above, where feedback would act to improve input representations (as in the verification model of Becker 1980). Such models incorrectly predict that lexical influences will always increase perceptibility. This in turn suggests that if the recognition system were interactive, it would be more likely to have the characteristics of an interactive bias model than of a verification model. As we have argued, however, interaction in bias models cannot improve word recognition and can cause misidentifications at the phonemic level.
We have argued that, in general, feedback of lexical activation can only bias phoneme identification, without actually improving sensitivity1. Indeed, Samuel's (1996) phoneme restoration results were consistent with both TRACE and the Race model in showing that lexical information biased subjects toward lexically consistent responses instead of improving their sensitivity in discriminating between phonemes and noise. Nevertheless, if one could devise an experiment in which lexical information were shown to improve listeners' sensitivity in phoneme identification this would prove particularly problematic for autonomous models. The standard way to investigate sensitivity and bias in perceptual experiments is to use Signal Detection Theory (SDT).
To many authors, SDT seems to offer a simple technique for distinguishing between interactive and autonomous theories. The decision to use SDT has generally been based on the idea that changes in sensitivity, as measured by d', reflect changes in perceptual processes (cf. Farah 1989). For example, if context provided by lexical information influences sensitivity in phoneme identification, this is taken as evidence that the contextual factor is interacting with the perceptual processes of phoneme identification. Although this is an appealing notion, applying SDT here is far from straightforward. In general, SDT studies of interaction have either used inappropriate analyses or drawn invalid conclusions from them.
One of the central pitfalls of applying SDT analyses is evident in work on the influence of context on visual word recognition. Rhodes, Parkin and Tremewan (1993) used SDT to study context effects in lexical identification. They applied standard unidimensional signal detection theory to data from visual word recognition and reported that semantic priming did indeed alter d'. From this they concluded that context was influencing the perceptual analysis of the words, in violation of modularity (Fodor 1983; 1985). Norris (1995), however, pointed out that the standard unidimensional SDT model was inappropriate under these circumstances because its assumptions do not map onto those of any current or even any plausible model of visual word recognition. Norris also showed that the unidimensional measure of sensitivity (d') cannot even account for the basic effect of semantic priming and that a multidimensional version of SDT, embodying more plausible assumptions about visual word recognition (similar to the logogen model, Morton 1969, or the checking model, Norris 1986), could account for the same data purely in terms of response bias, with no need for context to influence earlier perceptual processes. The lesson here is that the choice of SDT model must be guided by plausible psychological task models. If the underlying assumptions of SDT are not satisfied in the psychological model, the results of the SDT analysis will be meaningless.
Confusion over the interpretation of the results of SDT analysis can lead authors to make claims which are not justified by the data even when the technical application of SDT seems appropriate. In a study of assimilation effects in speech perception (e.g. freight bearer may be produced as frayp bearer but heard as freight bearer ), Gaskell and Marslen-Wilson (1998) found that subjects were less able to perceive the assimilation in words than nonwords. From this they concluded that the effect was `perceptual'; their line of reasoning seems to be similar to Farah's argument that sensitivity effects must be due to perceptual processes. However, Gaskell and Marslen-Wilson (1998) also assume that the sensitivity effects tell them about the locus of the perceptual effect, namely, that it was not based on the output of lexical processing. Conclusions about the locus of the effect cannot be supported by this kind of data. The SDT analysis simply informs us that the discrimination performance of the system under observation is worse in words than in nonwords. Such data are perfectly consistent with the idea that the change in sensitivity arises after lexical processing. For example, if lexically derived phonemic information always determined responses when it was available, then detection of assimilation would always be much worse in words than nonwords even though the locus of the effect was at a late stage where lexical and prelexical information was combined. That is, a late bias to respond lexically will be manifest as a decrease in sensitivity when the overall performance of the system is subject to SDT analysis.
A similar problem arises in a phonetic categorization study. Pitt (1995) used SDT to
analyze the influence of lexical information on categorization. He concluded that lexical
information influenced the perceptual analysis of phonemes and that his data supported
interactive theories over autonomous ones. Pitt showed a lexical effect on the
categorization of the phonemes /
/ and /k/ in the continua gift-kift and
giss-kiss . He then transformed his data into d' values by converting each
subject's proportion of /
/ responses for each step of both continua
into z scores and then calculating d' by subtracting adjacent z scores. When plotted in
this fashion the two continua differed in the location of their peak d' score. Pitt
concluded that since lexical information shifted the d' measure it must have been having
an effect on phoneme perception, and that this was evidence of interaction.
This shift in the peak of the d' function, however, is simply a direct reflection of the shift in the conventional identification function. Lexical information has not increased the observer's overall ability to discriminate between the two phonemes; it has just shifted the category boundary. As is usual in categorical perception studies, the category boundary corresponds to the peak in the discrimination function and the maximum slope of the identification function. Lexical information has not enabled the listener to extract more information from the signal; it has just shifted the point of maximum sensitivity. Lexical information has indeed altered the pattern of sensitivity, but it is the position not the amount of sensitivity that has been changed. Exactly this kind of lexically induced boundary shift can emerge from an autonomous bias model (as Pitt 1995 in fact admits). Indeed, Massaro and Oden (1995) showed that the autonomous Fuzzy Logical Model of Perception (FLMP; Massaro 1987; 1989b; 1997; Oden & Massaro 1978) could fit Pitt's data very accurately.
Pitt (1995) makes much of the fact that the change in identification functions induced by lexical information is different from that induced by a monetary payoff designed to bias subjects toward one response over the other. Following Connine and Clifton (1987), he argues that these differences could be evidence that the lexical effects are due to feedback from the lexicon. However, there are two quite distinct ways in which bias can influence phoneme identification. Monetary payoff and lexical information appear to operate in these different ways, but neither requires top-down feedback.
Monetary payoff tends to shift the functions vertically, while lexical bias produces a horizontal shift. The simple interpretation of the vertical shift is that subjects have a general bias to respond with the financially favored phoneme on some proportion of trials. Massaro and Cowan (1993) call this ``decision bias'' and distinguish it from ``belief bias''; an example of the pattern can be clearly seen in the fast RTs in Pitt's Figure 6 (p. 1046). The lexical shift, on the other hand, reflects a bias toward lexically favored responses on the basis of less bottom-up support than lexically unfavored responses. This leads to the horizontal shift of the boundary illustrated in Pitt's Figures 1 and 2 (pp. 1040,1041). Two different patterns of data, but both are the result of biases.
To begin to make a case against autonomous models on the issue of sensitivity, one would need to demonstrate that lexical information could actually improve phoneme discriminability. That is, lexical information in a word-nonword continuum should result in improved discriminability (greater accuracy in paired-alternative discrimination or an increase in the number of Just Noticable Differences [JNDs] between the two ends of the continuum) relative to a nonword-nonword continuum. But no study has yet shown evidence that lexical information can produce any increase in the sensitivity of phoneme discrimination. So, although SDT may seem to offer a simple method of distinguishing between autonomous and interactive models, its use is far from straightforward and there are no SDT studies to date which allow us to distinguish between the models.
Even if lexical feedback can't improve the accuracy of recognition, it might help speed the recognition process. But consider what would happen in a model like TRACE if we optimized all connections to perform both phoneme and word recognition as quickly and accurately as possible in the absence of top-down feedback. Feedback could definitely speed recognition of both words and phonemes (in exactly the same way as context or frequency could speed word recognition in the logogen model), but the effect of this speed-up would be for the system to respond on the basis of less perceptual information than before. As feedback cannot improve the accuracy of word recognition, however, faster responses made on the basis of less perceptual information must also be less accurate. Also, following the arguments in the previous sections, it is only when the top-down information is completely reliable that there can be an increase in phoneme recognition speed without a decrease in accuracy. Furthermore, autonomous models such as FLMP (Massaro 1987) and Merge (which will be presented in section 5 below) can engage in a similar speed-accuracy trade off by reducing the recognition criterion for word identification or choosing to place more emphasis on lexical than phonemic information when performing phoneme identification. Thus lexical information can speed recognition in interactive models, but no more than the same lexical information can speed recognition in bottom-up models.
The case we have made against feedback in spoken word recognition has a direct parallel in visual word recognition. If we replace 'phoneme' with 'letter', then essentially the same arguments apply to reading as to speech. A review of empirical work in reading reveals a state of affairs closely analogous to that in speech: there is very solid evidence that lexical factors can influence letter recognition, but little evidence that this must be achieved via feedback. In visual word recognition research the interaction debate has concentrated on the proper explanation of the Word Superiority Effect (WSE). With brief presentations and backward masking, letters can be more readily identified in words than in nonwords, and more readily identified in words than when presented alone (Reicher 1969). According to McClelland and Rumelhart (1981), this lexical advantage is to be explained by top-down feedback from lexical nodes to letter nodes in their Interactive Activation Model (IAM). In this respect the explanation of the WSE given by the IAM is exactly the same as the explanation of lexical effects given by TRACE: feedback from the lexical level activates letter nodes and makes letters which are consistent with words more readily identifiable. McClelland and Rumelhart's interactive account is also consistent with the results of many studies which have shown that letters in pronouncable nonwords are easier to identify than letters in unpronounceable nonwords (e.g. Aderman & Smith 1971) and that letters in pseudowords are easier than letters alone (McClelland & Johnston 1977). Rumelhart and McClelland (1982) also found an advantage for letters in ``word-like'' all-consonant strings like SPCT over ``nonword-like'' strings such as SLQJ . However, although these data are consistent with interactive theories, the WSE can also be explained without interaction. This has been clear for more than two decades (see e.g. Massaro, 1978).
Two more recent models have succeeded in capturing the crucial empirical findings on the
WSE without recourse to feedback. First, the Activation-Verification Model (AVM) of Paap
et al. (1982) has provision for a top-down verification process; however, its explanation
of the WSE involves no feedback at all. In the AVM, visual input first activates a set of
letters, which in turn activate a set of candidate words. With verification in operation,
words in the candidate set are verified against the input. However, the verification
process is assumed to be unable to operate in the case of the brief pattern-masked
displays used to study the WSE. The AVM account of the WSE is therefore bottom-up. Under
these circumstances, letter identification decisions can be made by pooling the letter
identity information with information from any lexical candidates activated above a given
threshold. If the total lexical information exceeds a second threshold, then letter
identification decisions are made on the basis of the lexical information alone. As in the
Race model, therefore, lexical effects on letter identification come about because letter
information is available from the lexicon. However, in contrast to the Race model, the
lexical information can be derived from more than one word candidate, and letter and word
information can be pooled together. As in the Race model, the letter identity information
read out from the lexicon is not fed back to the initial stage of letter identification.
Because the decision process can pool lexical information from a number of candidates, the
model can account for the fact that letters in pseudowords are better identified than in
unpronounceable letter strings. A pseudoword is likely to activate words containing some
of the same letters in the same positions. An unpronounceable letter string is unlikely to
activate many words and those words are unlikely to contain the same letters in the same
positions as the pseudoword.
Second, the Dual Read Out Model (DROM) of Grainger and Jacobs (1994) can also give a feedback-free account of the WSE. Architecturally, this model is similar to the IAM of Rumelhart and McClelland. Both models are IAMs with levels corresponding to features, letters and words. The main difference is that, in the DROM, subjects can base their decisions either on letter-level activation (the only possibility in the IAM) or by reading out orthographic information from the most activated word (as opposed to a set of words in the AVM). Grainger and Jacobs examined the behaviour of their model both with and without top-down feedback. Without top-down feedback the DROM is essentially the visual equivalent of the Race model, and there is none of the pooling of letter and lexical information that takes place in the AVM. Although Grainger and Jacobs suggest that the DROM slightly underestimates the size of the pseudoword advantage in the absence of feedback, they point out that this problem could be overcome if there were an extra level of orthographic representation between the letter and word.
In contrast to the IAM, both the AVM and DROM are what Cutler et al. (1987) term multiple-outlet models. Both lexical effects and pseudoword effects can be explained by permitting decisions to be made on the basis of either letter or lexical information, without any need for the processes delivering this evidence to interact. In the case of visual word recognition there appears to be no sign of an imminent resolution of the interaction/autonomy debate. Although the WSE and related findings give the appearance of being evidence for feedback, as we have shown, a number of bottom-up explanations are also available. By Occam's principle then, the bottom-up theories should be preferred to the interactive theories.
We have argued that there is no need for the results of word recognition to be made known to earlier stages. Stages of phoneme or letter recognition simply do their best and pass that information on. Nothing they do later depends on whether their outputs agree with decisions reached at the lexical level. However, this relationship between levels does not necessarily hold throughout language processing; there may be cases in which feedback could indeed confer advantages. In research on syntactic processing, for example, there has been a lively debate as to whether syntactic analysis is independent of higher level processes such as semantics, and this debate is by no means resolved. Note, however, that terminology in this area can contrast with that used in word recognition. In parsing there are theories referred to as ``autonomous'' which allow some feedback from semantics to syntax, and theories called ``interactive'' which maintain autonomous bottom-up generation of syntactic parses! As we shall show, however, examining the characteristics of the models makes it possible to compare them in the framework that we have been using throughout this discussion.
An important early model in this field was the ``garden-path'' theory of Frazier (1979, 1987). In Frazier's model, the parser generates a single syntactically determined parse. Initial choice of the parse is entirely unaffected by higher level processes, and Frazier laid great emphasis of this aspect of her model: syntax maintained autonomy from semantics. However, the model also needs to explain what will happen if this unique initial choice of parse turns out to be wrong. In a classical garden path sentence like Tom said that Bill will take the cleaning out yesterday , for example, the initial parse leads to the semantically implausible interpretation that Bill is going to perform a future action in the past. Worse still, sentences like The horse raced past the barn fell can lead to the situation where no successful analysis at all is produced; although they are grammatical sentences of English, the syntactic processor simply fails to produce a complete output, in that The horse raced past the barn is assigned a full analysis which leaves no possible attachment for the subsequent word fell .
Frazier's model assumes that in such cases the system will need to reanalyse the input and to generate an alternative parse. But the fact that this must be a different parse from the one that was first generated compromises the autonomy of the model. That is, the parser needs to be told that the earlier interpretation was unsatisfactory and that another parse should be attempted, since otherwise it will simply produce the same output once again. To this end, information must be conveyed from the interpretive mechanism to the parser. This feedback simply takes the form of an error message: produce another parse, but not the same one as last time. Higher level processes still have no direct control over the internal operation of the parser. Nevertheless, in order to account for successful eventual resolution of garden paths, Frazier's ``autonomous'' model must incorporate some degree of informational feedback.
Alternative models of syntactic processing include fully interactive theories in which semantic, or other higher level information, can directly constrain the operation of the syntactic parser (McClelland, St. John & Taraban 1989; Taraban & McClelland 1988). However, there is also the approach of Altmann, Steedman and colleagues (Altmann & Steedman 1988; Crain & Steedman 1985; Steedman & Altmann 1989), which the authors termed ``weakly interactive''. In this approach the syntactic processor is held to generate potential parses in a fully autonomous manner, but in parallel: the alternative candidate parses are then evaluated, again in parallel, against the semantic or discourse context. The interpretations are constructed incrementally and continually revised and updated, such that most alternatives can be quickly discarded; indeed, it was assumed that strict time limits applied on the maintenance of parallel candidates, and that these time limits explained why the wrong parse could triumph in a garden path sentence. Reanalysis of garden paths requires, in this approach, no constraining feedback from higher-level to syntactic processing, since it can be achieved by repeating the same syntactic generation of alternative parses, but relaxing the time limits applying at the selection stage.
Although termed ``interactive'' by its authors, this model allows no feedback from higher-level processing to influence which parse is generated. This renders it effectively more autonomous than Frazier's model. Probably the leading current models of syntactic processing are found among the class of constraint satisfaction models (e.g. Boland 1997; MacDonald, Pearlmutter & Seidenberg 1994; Trueswell & Tanenhaus 1994); these models differ in their details, but in general share with the ``weak interaction'' approach the feature that syntactic analyses are computed in parallel and that higher-level information, though it is used early in processing, constrains selection of syntactic structure but not initial generation.
Boland and Cutler (1996) compared the way the labels ``autonomous'' and ``interactive'' were used in the word recognition and parsing literature, and concluded that these terms were not adequate to capture the true dimensions of difference between the models. The two research areas differed, they pointed out, in whether debate about the influence of higher-level processing concerned principally the generation of outputs by the lower-level process, or selection between generated outputs. In word recognition, there is debate about the autonomy of the initial generation process, but relative unanimity about the availability of higher-level information to inform the final selection between generated candidates. In parsing, in contrast, there is comparative agreement among models that the initial generation of syntactic structure is autonomous, but lively debate about whether selection of the correct parse takes higher-level information into account or not. What is notable, however, is that to argue for the strictest autonomy of initial syntactic processing, with the processor producing only a single output, necessarily implies allowing for at least a minimal form of feedback to account for the known facts about the processing of garden path sentences.
Of course, a system which avoided all feedback between semantic and syntax could be achieved if the parser were to have no capacity limitations, so that it could pursue all parses in parallel. In this case syntactic garden paths would never arise (for further discussion of this point, see Norris 1987); but they do arise, so this system cannot be the correct one. Here we begin to get a crucial insight into the factors that determine the value of feedback. Our model, as all models of word recognition, embodies the assumption that prelexical processing can consider the full set of prelexical units - phonemes or letters - in parallel. But consider what might happen if phoneme recognition were a serial process, in which each of the 40-plus phonemes of English had to be tested against the input in sequence. In such circumstances, an advantage might accrue if lexical information were allowed to determine the order in which phonemes were tested, so that lexically more probable phonemes were tested first. Testing the most probable phoneme first could confer a considerable advantage on the speed with which that phoneme could be identified, at only a marginal cost in the recognition of other phonemes if the lexical information proved inaccurate. So, our argument against feedback in word recognition can now be seen to rest on the important assumption that phoneme recognition is a parallel process. Note that this assumption also covers our earlier comments about verification models. If the system is parallel, and not resource-limited, then all phonemes should be fully analysed to the best of the system's capability. That is, there is no advantage in producing an initial low-quality analysis which is then improved on instruction from higher levels.
From this point of view it is clear that our argument against feedback in word recognition cannot necessarily be applied across the board to every relationship between different levels of language processing. The question of syntactic-semantic interaction has led to a different debate than the case of prelexical versus lexical processing; models both with and without feedback have again been proposed, but the role of feedback is not the same in all models. The precise function and the necessity of feedback can only be evaluated in the light of constraints specific to the type of processing involved.
We have argued that there are no good a priori reasons for favouring interactive models over autonomous models of spoken word recognition. Feedback in bias models like TRACE is not able to improve word recognition. Interaction of this type could improve phoneme recognition, but it does so at the cost of making phonemic decisions harder when the input is inconsistent with lexical knowledge, and at the cost of potential misperceptions (the perception of lexically consistent phonemes even when they did not occur in the speech input). Although feedback could potentially act to improve perceptual sensitivity, recent studies suggest that lexical context has a purely biasing effect on phoneme identification (Pitt 1995; Massaro & Oden 1995).
We have further argued that feedback is also not required in visual word recognition. Autonomous models of reading are to be preferred since there are no data which require there to be top-down interaction. It is clear that modular models are particularly well-suited to the constraints of word recognition. Because the full set of prelexical units (phonemes or letters) can be considered in parallel, feedback cannot improve performance at either the lexical or prelexical level. In sentential processing, however, resource limitations which prevent the parallel examination of all possible parses could at least in principle make the use of feedback beneficial. However, even here the extent of the interaction remains an empirical issue. Adopting Occam's razor we should still assume only the minimum degree of feedback required by the data. It is also noteworthy that although the constraints on the production of language might suggest a role for feedback loops in that process - for example as a control mechanism - it again appears that feedback is not required, and it is not incorporated into the latest model of the process (Levelt, Roelofs & Meyer 1999).
On these grounds alone, therefore, one should be tempted to conclude in favor of autonomous models. But such a conclusion cannot be adopted without examination of the available data, since it remains possible that there are data which can be accounted for by interactive but not by autonomous models. We therefore turn now to an examination of this evidence, again focussing on lexical involvement in phonemic decision making. Although it has proved difficult to resolve the debate between interactive and autonomous models in the visual case, new data on spoken-word recognition, some of which take advantage of phenomena specific to speech, have provided evidence which strongly favors autonomous theories over interactive ones. We begin by looking specifically at data which have challenged either TRACE or the Race model, or both.
Because the Race model and TRACE are both designed to account for the same general set of phenomena, few of the findings in the literature present an insurmountable problem for either model. However, there are a number of results showing variability in lexical effects which appear to be more consistent with the underlying principles of the Race model than with TRACE. Cutler et al. (1987) characterized the Race model as a multiple-outlet model. Responses can be made via either a lexical or prelexical outlet. TRACE, on the other hand, has only a single outlet. All phoneme identification responses must be made by reading phonemic information from the phoneme nodes in TRACE. One consequence of this difference is that, according to the Race model, it should be possible to shift attention between the two outlets. That is, lexical effects should not be mandatory. To the extent that attention can be focussed on the prelexical outlet, lexical effects should be minimized. Conversely, lexical effects should be at their greatest when attention is focussed on the lexical outlet.
This is exactly the pattern of results that has been observed in a number of studies. Cutler et al. (1987) showed that the lexical effects found in monitoring for initial phonemes in monosyllabic targets were dependent on the composition of filler items in the experiment. Lexical effects were only present when filler items varied in syllabic length. There were no lexical effects with monosyllabic fillers. Cutler et al. argued that the monotonous nature of the monosyllabic filler condition led subjects to focus their attention at the prelexical outlet, with the effect that any potential influence of lexical information would be attenuated. This shift in attention between outlets is a natural consequence of the Race model architecture. However, to account for the same effect in TRACE would require the model to be able to modulate the overall weighting of the word-phoneme feedback connections. (A similar suggestion has been made for varying the weight of letter-to-word inhibition in response to experimental conditions in visual word recognition; Rumelhart & McClelland 1982.) But if word-phoneme feedback connections were important for the proper functioning of the speech recognition system, it is not clear why it should ever be either possible, or desirable, to reduce their effectiveness.
Further evidence that lexical effects in phoneme monitoring are volatile and depend on having listeners focus their attention at the lexical level comes from a set of experiments by Eimas, Marcovitz Hornstein and Payton (1990) and Eimas and Nygaard (1992; see also Foss & Blank 1980; Foss & Gernsbacher 1983; Frauenfelder & Seguí 1989; Seguí & Frauenfelder 1986). Eimas et al. (1990) found that lexical effects on phoneme-monitoring targets in syllable-initial position in items in lists emerged only with the inclusion of a secondary task which oriented attention towards the lexical level. So, lexical effects emerged with a secondary task of either noun versus verb classification, or lexical decision, but not with a secondary length-judgement task. Eimas and Nygaard (1992) extended this work by showing that there were no lexical effects on target detection in sentences, even with secondary tasks. They suggested that when listening to sentences subjects could perform the secondary task by attending to a sentential (syntactic) level of representation. Attention would then be allocated to this level of processing, and phoneme monitoring would be based on prelexical codes. Their data are particularly puzzling from the interactive standpoint. If interaction is important in the normal process of sentence understanding, it is strange that this is exactly the situation where it is hardest to obtain evidence of lexical effects.
The idea that lexical effects have to be specially engineered also emerges from studies
of phonetic categorization. Burton, Baum and Blumstein (1989) found that lexical effects
were present only in the absence of complete phonetic cues. McQueen (1991) studied lexical
influences on categorization of word final fricatives. At the end of words, top-down
effects in a model like TRACE should be at their maximum. Furthermore, the stimuli
included fricatives which were ambiguous between /s/ and /
/. With the input in
this finely balanced state, these should have been the ideal conditions to observe the
lexical influences that are predicted by a model like TRACE. However, McQueen found that
lexical effects emerged only when the stimuli were low-pass filtered at 3kHz. That is,
stimuli had to be not only phonetically ambiguous, but perceptually degraded too. A rather
weaker conclusion about the importance of degradation in obtaining lexical effects was
reached by Pitt and Samuel (1993) in their review of lexical effects in phonetic
categorization. Although they concluded that degradation was not actually a necessary
precondition for obtaining lexical effects, there seems to be little doubt that lexical
effects in categorization are enhanced by degradation. In both phonetic categorization and
phoneme monitoring therefore, lexical effects are not as ubiquitous as might be expected
from interactive models if such effects were due to a mechanism that could improve
recognition performance.
A further piece of data which appears problematic for TRACE comes from a phoneme monitoring experiment by Frauenfelder, Seguí and Dijkstra (1990). In a study conducted in French, they had subjects perform generalized phoneme monitoring on three different kinds of target. Target phonemes could appear in words after the uniqueness point (e.g. /l/ in vocabulaire ), in nonwords derived from the word by changing the target phoneme (/t/ in vocabutaire ) or in control nonwords ( socabutaire ). They argued that TRACE should predict that targets in the derived nonwords be identified more slowly than in control nonwords because the lexically expected phoneme should compete with the target due to top-down facilitation. However, according to the Race model, lexical effects on phoneme identification can only be facilitatory. As predicted by the Race model, there was indeed no difference between the nonword conditions, though both were slower than the word condition.
Wurm and Samuel (1997) replicated the Frauenfelder et al. (1990) findings but raised the possibility that inhibitory effects might be masked because the nonwords in which inhibition might be expected were easier to process than the control nonwords. They presented results from a dual task study which were consistent with their view that the experimental and control nonwords were not equally difficult. Nevertheless, there is still no direct evidence for inhibitory lexical effects in phoneme monitoring. We should also bear in mind that the claim that TRACE predicts inhibition from the lexicon is specific to the particular implementation of TRACE rather than true of interactive models in general (Peeters, Frauenfelder & Wittenburg 1989). We will return to this issue later when discussing simulations of these results. For the moment we will simply note that TRACE could be modified to incorporate at the phoneme level a priority rule similar to Carpenter and Grossberg's (1987) ``two-thirds rule''. In the context of a simple interactive activation model, this would mean that top-down activation would only have an effect when at least some bottom-up activation was present. That is, feedback from lexical to phonemic nodes would be contingent on there being at least some perceptual support for the phoneme. The input vocabutaire would then not activate /l/ at all, and /l/ would therefore not inhibit /t/.
A strong apparent challenge to autonomous models comes from an ingenious study by Elman and McClelland (1988). As mentioned above, a common criticism of models with feedback is that they run the risk of misperceiving speech. That is, if top-down information can actually determine which lower-level representations are activated, the system may perceive events which, although consistent with top-down expectation, are not actually present in the real world. In the case of TRACE, top-down activation feeding back from lexical nodes to phoneme nodes leads to activation of the phoneme nodes which is indistinguishable from activation produced by bottom-up input from featural information. Elman and McClelland took advantage of this property of TRACE to devise a test that would distinguish between the predictions of interactive models like TRACE and of autonomous models like the Race model.
We have seen that lexical effects like the Ganong (1980) effect, in which an ambiguous
stimulus on a type-dype continuum is more likely to be classified in accord with
the word ( type ) than the nonword ( dype ), can be explained by both
interactive and autonomous models. However, according to TRACE, the lexical bias will
actually alter the activation of the component phonemes. An ambiguous phoneme /?/ midway
between /
/ and /s/ will thus activate the /
/ phoneme
node in fooli? and the /s/ node in christma? . Elman and McClelland (1988)
harnessed the Ganong effect to a lower-level effect of compensation for coarticulation
(Mann & Repp 1981) according to which the position of the boundary between /t/ and /k/
is closer to /k/ following /
/ (i.e. there are more /t/ responses) and closer
to /t/ (more /k/ responses) following /s/.
If the lexical bias in phonetic categorization has its locus only at the output of
phonemic information from the lexicon, as suggested by the Race model, the ambiguous
phonemes in fooli? and christma? should behave in the same way at the
prelexical level. The ambiguous phonemes are identical and they should have identical
effects on a following phoneme midway between /t/ and /k/. However, if TRACE is correct,
the lexical contexts foolish and christmas will determine whether /
/
or /s/ is activated, which should, in turn, produce an effect of compensation for
coarticulation, just as if the listener had heard a real /
/ or /s/. In line with
the predictions of the interactive model, Elman and McClelland found evidence of
compensation for coarticulation even with the ambiguous phoneme /?/.
One possible way in which proponents of autonomous models could avoid accepting these
data as evidence against autonomy is to suggest that the results are due entirely to
effects operating at the prelexical level. As an illustration of how this might be
possible, Norris (1993) presented a simulation of Elman and McClelland's result using a
simple recurrent network. In one of the simulations the network had no word nodes at all.
The network learned to use several phonemes of context in making decisions about phoneme
identity. A similar simulation has also been reported by Cairns, Shillcock, Chater and
Levy (1995). One might assume, as in TRACE simulations, that if there is a bias to
interpret /?/ as /
/ in the context fooli this must be
because of top-down feedback from a node corresponding to foolish at the lexical
level. But in the Norris (1993) and Cairns et al. (1995) simulations, the phoneme nodes
themselves learned something about the statistical properties of the language, that is,
which contexts they are most likely to appear in. It is this within-level statistical
information that leads to apparent interactive effects in these simulations.
Cairns et al. (1995) showed on the basis of an analysis of a large corpus of spoken
English that after /
/, /s/ is more likely than /
/, and
after /
/, /
/ is more likely than /s/. All of Elman and
McClelland's (1988) /s/-final words ended /
s/ and all of their /
/-final
words ended /![]()
/. Their materials therefore contained
sequential probability biases that could in principle be learnt at the prelexical level.
Elman and McClelland's results thus do not distinguish between interactive and autonomous
models since they can be explained either by top-down processing or by a sequential
probability mechanism operating prelexically.
Pitt and McQueen (1998) have tested these two competing explanations. They used nonword
contexts ending with unambiguous or ambiguous fricatives. The contexts contained
transitional probability biases; in one nonword /s/ was more likely than /
/, while in
the other /
/ was more likely than /s/. These contexts were
followed by a word-initial /t/-/k/ continuum. Categorization of the ambiguous fricative
reflected the probability bias. There was also a shift in the identification function for
the following /t/-/k/ continuum suggesting that compensation for coarticulation was being
triggered by the probability bias. These results lend support to the view that Elman and
McClelland's (1988) results were due to transitional probability biases rather than to the
effects of specific words. The original results can therefore no longer be taken as
support for interactive models.
The transitional probability effect is consistent with both autonomous models (where
the probability bias is learnt prelexically) and interactive models (where the bias could
be due either to top-down connections from the lexicon or to a prelexical sensitivity to
sequential probabilities). The compensation for coarticulation data presented so far
therefore do not distinguish between TRACE and the Race model. But other conditions in
Pitt and McQueen (1998) produced data which challenge interactive but not autonomous
models. Two word contexts were used ( juice and bush ) where the
transitional probabilities of /s/ and /
/ were matched. There was no shift in
the stop identification function following jui? and bu? , suggesting that
the compensation for coarticulation mechanism is immune to effects of specific lexical
knowledge. Crucially, however, there were lexical effects in the identification of the
ambiguous fricative (more /s/ responses to jui? than to bu? ).
These data are problematic for TRACE, since the model predicts that if the lexicon is acting top-down to bias fricative identification, the changes in activation levels of phoneme nodes produced by feedback should also trigger the compensation for coarticulation process. TRACE is therefore unable to handle the dissociation in the word contexts between the lexical effect observed in fricative labelling and the absence of a lexical effect in stop labelling. Furthermore, if TRACE were to explain both lexical effects in words and sequential probability effects in nonwords as the consequences of top-down connections, the model would be unable to handle the dissociation between the compensation effect in the nonword contexts and the lack of one in the word contexts. This latter dissociation therefore suggests that sensitivity to sequential probabilities should be modelled at the prelexical level in TRACE. Consistent with this view is a recent finding of Vitevitch and Luce (1998). They observed, in an auditory naming task, different sequential probability effects in words and nonwords. They argued that the facilitatory effects of high-probability sequences observed in nonwords were due to prelexical processes, while the inhibitory effects of high-probability sequences observed in words were due to the effects of competition among lexical neighbors sharing those (high-probability) sequences. But even if the compensation effect in nonword contexts could thus be explained in TRACE by postulating processes sensitive to sequential probabilities at the prelexical level, TRACE would remain unable to explain the dissociation in the word contexts between the lexical effect in fricative identification and the absence of lexical involvement in stop identification.
Pitt and McQueen's compensation data are however not problematic for the Race model. If the compensation for coarticulation process is prelexical, and there is sensitivity at that level to sequential probabilities (as the results of Vitevitch and Luce 1998 also suggest), the Race model can explain the nonword context results. Also in the Race model, fricative decisions in the word contexts can be based on output from the lexicon, but in line with the data, the model predicts that lexical knowledge cannot influence the prelexical compensation process. Clearly, however, the Race model would require development in order to give a full account of these data (specifically it requires the inclusion of prelexical processes that are sensitive to phoneme probabilities and a prelexical compensation mechanism as in the Norris and the Cairns et al. simulations). Nevertheless, Pitt and McQueen's nonword context results clearly undermine what had seemed to be the strongest piece of evidence for interaction. Furthermore, the word context results undermine models with feedback. The study also serves as a cautionary reminder that low-level statistical properties of the language can give rise to effects which can easily masquerade as top-down influences.
Samuel (1997) has recently reported data which he claims argue strongly for
interaction. Using a phoneme restoration paradigm he presented listeners with words in
which a given phoneme (/b/ or /d/) had been replaced by noise, and showed that these words
produced an adaptation effect: There was a shift in the identification of stimuli on a /b
/-/d
/
continuum relative to the pre-adaptation baseline. There was no such effect when the
phonemes were replaced by silence. Samuel (1997) argued that the noise-replaced phonemes
were being perceptually restored, and that these restored phonemes were producing
selective adaptation, just as if the actual phonemes had been presented. In support of his
claim that these adaptation effects have a perceptual locus, Samuel showed that the
adaptation effect observed with intact adaptors was not influenced by lexical factors.
However, the main problem in determining the implications of this study for the question
of interaction is that, in contrast to the compensation for coarticulation effect studied
by Elman and McClelland (1988), we do not know the locus of the adaptation effect in this
situation. Although it is clear that there are low-level components of selective
adaptation (e.g. Sawusch & Jusczyk 1981), recent results suggest that adaptation
operates at a number of different levels in the recognition system (Samuel & Kat
1996), including more abstract levels (labelled ``categorical'' by Samuel & Kat 1996).
It remains to be established what level(s) of processing are responsible for the
adaptation observed in Samuel's (1997) restoration study. If this adaptation effect is not
influencing prelexical processing it would not inform us about interaction. Indeed, the
model we will present below could account for these data by assuming that adaptation with
restored phonemes has its main influence on an output process, where categorical decisions
are made. Consistent with Samuel's experiments with intact adaptors, we would not expect
to see lexical effects where phoneme categorization had been determined by a clear
acoustic signal.
A further problem is that the pattern of adaptation produced by the restored phonemes differs somewhat from standard adaptation effects. Normal adaptation effects are usually found almost exclusively in the form of a boundary shift (e.g. Samuel 1986; Samuel 1997, Experiment 1). However, in the condition showing the greatest restored adaptation effect, the shift is practically as large at the continuum endpoints as at the boundary. The small shift that can be induced by a restored phoneme appears to take the form of an overall bias not to respond with the adapted phoneme.
Samuel's results contrast with those of Roberts and Summerfield (1981) and Saldaña and Rosenblum (1994), who used the McGurk effect (McGurk & McDonald 1976) to investigate whether adaptation was driven by the acoustic form of the input or by its phonemic percept. The Saldaña and Rosenblum study took advantage of the fact that an auditory /ba/ presented with a video of a spoken /va/ is perceived as /va/. However, adaptation was determined by the auditory stimulus and not by the phonemic percept. Even though the combination of auditory /ba/ and visual /va/ was perceived as /va/ all of the time by 9 out of 10 of their subjects, the effect of adaptation in the Auditory + Visual case was almost identical (in fact, marginally bigger in the A+V case) to that with an auditory /ba/ alone. There was no trace of any top-down effect of the percept on the adaptation caused by the auditory stimulus. (Although these studies show that adaptation does not depend on the phonemic percept , see Cheesman and Greenwood 1995, for evidence that the locus of the effect can indeed be phonemic or phonetic rather than acoustic). Although this might suggest that output or response processes cannot be adapted, it is also consistent with the view that the primary locus of adaptation is acoustic/phonetic and adaptation at an output level can only be observed in the absence of acoustic/phonetic adaptation.
Thus to draw any firm conclusions about interaction from Samuel's restoration study we would need to establish both that adaptation was a genuinely prelexical effect, and that the pattern of adaptation observed following restoration was identical to that produced by phonemic adaptation. Neither of these points has been properly established.
Overall, the Race model fares well in explaining why phoneme identification should be easier in words than nonwords. However, some recently reported studies also show lexical effects in the processing of nonwords, which present severe problems for the Race model.
Data from Newman, Sawusch and Luce (1997) from the phonetic categorization task may
present a challenge to the Race model. Their study employed a variant on the Ganong
effect. Instead of comparing word-nonword continua, they examined nonword-nonword
continua, where the nonwords at each continuum endpoint varied in their similarity to real
words. For example, the continuum gice-kice , where gice has more lexical
neighbors than kice , was compared with the continuum gipe-kipe , where the
opposite endpoint, kipe , had the higher neighborhood density. Newman et al. (1997)
found that there were more responses in the ambiguous region of the continuum consistent
with the endpoint nonword with a denser lexical neighborhood (i.e. more /
/
responses to gice-kice and more /k/ responses to gipe-kipe ). According to
the Race model there should be no lexical involvement in these nonword decisions.
However, as Vitevitch and Luce (1998) point out, there is a high positive correlation between lexical neighborhood density and sequential probability: Common sequences of phonemes will tend to occur in many words, so nonwords with dense lexical neighborhoods will tend to contain high probability sequences. The results of both Pitt and McQueen (1998) and Vitevitch and Luce (1998) suggest that the prelexical level of processing is sensitive to sequential probabilities (see section 4.3). It is therefore possible that Newman et al.'s results may reflect this prelexical sensitivity. If so, they would not be inconsistent with the Race model. It remains to be established whether the apparent lexical involvement in nonword decisions observed by Newman et al. (1997) is due to the effects of lexical neighborhood density or to a prelexical sensitivity to sequential probability. Only in the former case would these results then pose a problem for the Race model. The results would, however, be compatible with TRACE whichever level of processing proves responsible for the effect.
Connine, Titone, Deelman and Blasko (1997) have shown that monitoring for phonemes occurring at the end of nonword targets is faster the more similar the nonwords are to real words. In their experiment, nonwords were derived from real words by altering the initial phonemes of those words by either one feature on average or by six features on average (creating, for example, gabinet and mabinet from cabinet ). Monitoring latencies were faster in these derived nonwords than in control nonwords, and the single-feature-change nonwords led to faster responses than the multi-feature-change nonwords. Responses to targets in all nonwords were slower than those to targets in the real words from which they were derived. According to the Race model, the lexical route should not operate at all for targets in nonwords, so monitoring latencies should be unaffected by the similarity of the nonwords to real words. This means that the first-past-the-post Race model, in which responses must be determined by either the phonemic or the lexical route, can no longer be sustained. Interactive models like TRACE, however, predict increasing top-down involvement in nonwords, and hence faster monitoring responses, the more similar nonwords are to words.
Wurm and Samuel (1997) have also shown that phoneme monitoring is faster in nonwords which are more like real words than in nonwords which are less like real words. It is important, however, to point out that this effect is not the same as that found by Connine et al. (1997). In the latter study, the words and nonwords all shared the same target phoneme (e.g. the /t/ in cabinet, gabinet and mabinet ). In Wurm and Samuel's (1997) study, however, as in the study by Frauenfelder et al. (1990) on which it was based, the nonwords and the words on which they were based were designed to differ on the crucial target phoneme, that is, the target in the nonwords did not occur in the base word (e.g. the /t/ in both vocabutary and socabutary mismatches with the /l/ in vocabulary ). In these cases, therefore, the lexical information specifying an /l/ could not possibly make detection of the /t/ easier; it could only hinder detection of the /t/ (but, as discussed in section 4.2, both Frauenfelder et al. 1990 and Wurm & Samuel 1997 failed to find this inhibition). The facilitation which Wurm and Samuel (1997) found (e.g. faster /t/ responses in vocabutary than in socabutary ) is thus different from Connine et al.'s (1997) finding, and is probably due, as argued by Wurm and Samuel (1997), to an attentional effect which makes more word-like strings easier to process.
A study by Marslen-Wilson and Warren (1994) is particularly important because it provides evidence against both TRACE and the Race model. This study, based on earlier work by Streeter and Nigro (1979) and Whalen (1984; 1991), examined subcategorical phonetic mismatch (Whalen 1984), and the differential effects of that mismatch in words and nonwords. Streeter and Nigro cross-spliced the initial CV from a word like faded with the final syllable of a word like fable . The cross-splice creates conflicting phonetic cues to the identity of the medial consonant /b/; the transition from the first vowel provides cues appropriate to /d/ rather than /b/. A parallel set of stimuli was constructed from nonwords. Interestingly, the phonetic mismatch slowed auditory lexical decisions to the nonword stimuli, but not to the word stimuli. A similar interaction between phonetic mismatch and lexical status was found by Whalen (1991). The design of Marslen-Wilson and Warren's study is rather more complicated, but it will be described in some detail since this will be essential for understanding the simulations presented later.
The critical stimuli used in their experiments were based on matched pairs of words and nonwords like job and smob . Three experimental versions of these stimuli were constructed from each word and nonword by cross-splicing different initial portions of words and nonwords (up to and including the vowels) onto the final consonants of each critical item. These initial portions could either be from another token of the same word/nonword, from another word ( jog or smog ), or from another nonword ( jod or smod ). The design of the materials is shown in Table 1, which is based on Table 1 from Marslen-Wilson and Warren (1994). Marslen-Wilson and Warren performed experiments on these materials using lexical decision, gating and phonetic categorization tasks. The important data come from the lexical decision and phonetic categorization experiments using materials where the critical final phoneme were voiced stops. In both of these tasks the effect of the cross-splice on nonwords was much greater when the spliced material came from a word (W2N1) than a nonword (N3N1) whereas the lexical status of the source of the cross-spliced material (W2W1 vs N3W1) had very little effect for words. Within cross-spliced nonwords therefore, there was an inhibitory lexical effect: performance was poorer when the cross-spliced material in the nonword came from a word than when it came from another nonword.
Table 1. Experimental Conditions in Marslen-Wilson and
Warren (1994) and McQueen, Norris and Cutler (in press).
Item type Notation Example Word job 1. Word 1 + Word1 W1W1 job + job 2. Word 2 + Word1 W2W1 jog + job 3. Nonword 3 + Word1 N3W1 jod + job Nonword smob 1. Nonword 1 + Nonword1 N1N1 smob + smob 2. Word 2 + Nonword1 W2N1 smog + smob 3. Nonword 3 + Nonword1 N3N1 smod + smob
Note. Items were constructed by splicing together the underlined portions.
The implication of this result for the Race model should be clear. Phonemic decisions about nonword input can only be driven by the prelexical route. They should therefore be unaffected by the lexical status of the items from which the cross-spliced material is derived. But these results are also problematic for TRACE. Marslen-Wilson and Warren (1994) simulated their experiments in TRACE. They showed that the TRACE simulations deviated from the data in a number of important respects. The primary problem that they found was that TRACE predicted a difference between the cross-spliced words (poorer performance on W2W1 than on N3W1) which was absent in the human data. TRACE also overestimated the size of the inhibitory lexical effect in the cross-spliced nonwords.
McQueen, Norris and Cutler (in press) reported four experiments using the same design as Marslen-Wilson and Warren (1994). Although these experiments were conducted in Dutch, the materials were modelled closely on those used by Marslen-Wilson and Warren. McQueen et al. found that the interaction between lexical status and the inhibitory effect of competitors could be altered by subtle variations in experimental procedure. When they used a lexical decision task to emphasize lexical processing, there was a clear and reliable mismatch effect which interacted with the lexical status of the cross-spliced portions in nonwords but not in words, just as in Marslen-Wilson and Warren (1994). However, in experiments using phoneme monitoring and phonetic categorization, respectively, McQueen et al. failed to replicate the mismatch effects. McQueen et al. noted that there were two differences between their categorization experiment and that of Marslen-Wilson and Warren. First, Marslen-Wilson and Warren had varied the assignment of responses to left and right hands from trial to trial, whereas McQueen et al. had kept the assignment constant throughout the experiment. A second difference was that McQueen et al. used only unvoiced stops (/p,t,k/) as final segments whereas Marslen-Wilson and Warren had used both voiced and unvoiced stops. Both of these differences would have made the task harder in Marslen-Wilson and Warren's study. McQueen et al. therefore ran a further experiment in which they incorporated a wider range of targets and varied the response hand assignment. Under these conditions, McQueen et al. were able to produce an inhibitory lexical effect in cross-spliced nonwords. As in Marslen-Wilson and Warren (1994), responses to targets in W2N1 nonwords were slower than responses to targets in N3N1 nonwords.
The inhibitory effect of lexical competitors on phonemic decisions in nonwords therefore follows a similar pattern to the facilitatory effects of lexical status seen in phoneme monitoring. Cutler et al. (1987), for example, showed that the size of the facilitatory lexical effect (faster responses to targets in words than in nonwords) can be modulated by task demands. McQueen et al. have shown that inhibitory lexical effects in cross-spliced nonwords can be modulated in a similar manner. As pointed out earlier, the variability of lexical involvement in phonemic decision-making is problematic for TRACE. Lexical effects in phoneme decisions to targets in nonwords at all, even if those effects are variable, pose problems for the Race model.
This review demonstrates that neither TRACE nor the Race model is now tenable. TRACE is challenged by the findings showing variability in lexical effects, by the lack of inhibitory effects in nonwords in Frauenfelder et al. (1990), by the latest data on compensation for coarticulation, and by the data on subcategorical mismatch (Marslen-Wilson & Warren 1994; McQueen et al. in press). Marslen-Wilson and Warren explicitly attempted to simulate their results in TRACE, without success. The Race model, similarly, is challenged by the demonstrations of lexical involvement in phonemic decisions on nonwords. Three recent studies (Connine et al. 1997; Marslen-Wilson & Warren 1994; McQueen et al. in press) all show lexical effects in decisions made to segments in nonwords (as may also another, Newman et al. 1997). Such effects are incompatible with the Race model's architecture whereby the lexical route can only influence decisions to segments in words.
There would thus appear to be no available theory that can give a full account of the known empirical findings in phonemic decision making. In the following section we will show, however, that it is possible to account for the data in a model which reflects the current state of knowledge about prelexical processing in spoken-word recognition, and we will further demonstrate that our proposed new model successfully accounts for a wide range of results. Moreover, we will argue that this new model -- the Merge Model -- remains faithful to the basic principles of autonomy.
The models which we have contrasted above represent extreme positions with regard to the relationship between lexical and prelexical processing in phonemic processing. TRACE has an architecture in which the lexical level is directly linked via hardwired connections to the prelexical level, and responses must be susceptible to whatever lexical information is available. The Race model has an architecture in which responses via the lexical or the prelexical level are completely independent. Both architectures have now been found wanting.
What is required is, instead, a model in which lexical and prelexical information can jointly determine phonemic identification responses. Such a model must be able to allow for variability in the availability of lexical information, such that responses to a given phoneme in a given input may be susceptible or not to lexical influence, as a function of other factors. In other words, the model must not mandate lexical influence by fixed connections from the lexical to the prelexical level; but neither must it avoid all possibility of lexical and prelexical co-determination of responses by making lexical information only available via successful word recognition. The model must moreover capture this variability in just such a way as to be able to predict how strong lexical influences should be in different situations.
The required class of models consists of those in which the outputs of processes which are themselves fully autonomous can be integrated to determine a response. Such models have been proposed in several areas. A general model of perception incorporating such an approach, for instance, is the FLMP (Massaro 1987; 1997), in which multiple sources of information are simultaneously but independently evaluated, and continuous truth values are assigned to each source of information as a result of the evaluation process. Specifically with respect to lexical and prelexical processing, an example of an integration model is Norris' (1994a) model of the transformation of spelling to sound in the pronunciation of written words, or the Activation Verification Model of reading (Paap et al. 1982).
Applied to the issue of phonemic decision-making, the integration approach allows prelexical processing to proceed independently of lexical processing but allows the two processes to provide information which can be merged at the decision stage. In the Merge model, prelexical processing provides continuous information (in a strictly bottom-up fashion) to the lexical level, allowing activation of compatible lexical candidates. At the same time, this information is available for explicit phonemic decision-making. The decision stage, however, also continuously accepts input from the lexical level, and can merge the two sources of information. Specifically, activation from the nodes at both the phoneme level and the lexical level is fed into a set of phoneme-decision units responsible for deciding which phonemes are actually present in the input. These phoneme decision units are thus directly susceptible to facilitatory influences from the lexicon, and by virtue of competition between decision units, to inhibitory effects also.
In Merge there are no inhibitory connections between phoneme nodes at the prelexical level. The absence of inhibitory connections between phonemes at this level is essential in a bottom-up system. Inhibition at this level would have the effect of producing categorical decisions which would be difficult for other levels to overturn; information vital for the optimal selection of a lexical candidate could be lost. If a phoneme is genuinely ambiguous, that ambiguity should be preserved to ensure that the word that most closely matches the input can be selected at the lexical level. For example, if phoneme is pronounced as /?onim/ where /?/ is slightly nearer a /v/ than /f/, then inhibition between phonemes would leave the representation /vonim/ to be matched against the lexicon. This would be a much worse match to phoneme than a representation which preserved the input as giving partial support to both /v/ and /f/. There is, however, between-unit inhibition at the lexical level and in the decision units. The lexical-level inhibition is required to model spoken-word recognition correctly, and the decision-level inhibition is needed for the model to reach unambiguous phoneme decisions when the task demands them.
This need to have between-unit inhibition at the decision level, but not at the level of perceptual processing itself, is in itself an important motivation for the Merge architecture. Perceptual processing and decision making have different requirements and therefore cannot be performed effectively by the same units. So even if the question of interaction were not an issue, any account of phonemic decision making should separate phonemic decision from phonemic processing. The structure of Merge thus seems essential for maximum efficiency.
The empirical demonstration of lexical effects on nonwords does not constitute a problem for the Merge model, since on this account nonwords activate lexical representations to the extent that they are similar to existing words; this additional activation can facilitate phoneme detection. In the Merge model it is not necessary to wait (as in the Race model) until one of the two routes has produced a clear answer, since the output of those routes is continuously combined. However it is also not necessary to compromise the integrity of prelexical processing (as in TRACE) by allowing it to be influenced by lexical information. Further, Merge is prevented from hallucinating by the incorporation of a bottom-up priority rule. This rule, following the suggestion of Carpenter and Grossberg (1987), prevents decision nodes from becoming active at all in the absence of bottom-up support, thus ensuring that phonemic decisions are never based on lexical information alone. The Merge model therefore allows responses to be sensitive to both prelexical and lexical information, but preserves the essential feature of autonomous models -- independence of prelexical processing from direct higher-level influence.
One might argue that the addition of phoneme decision nodes opens the Merge model to attack from Occam's razor. Why have separate representation of the same information at both the prelexical and the decision levels; is this not unnecessary multiplication of entities? The addition of phoneme decision nodes certainly makes Merge more complex than the Race model, but their addition is necessary in order to account for the data which the Race model cannot explain. As we have already argued, bottom-up flow of information from prelexical to lexical levels is logically required for the process of spoken word recognition; and further, the decision units, which should have between-unit inhibition, must be separate from the prelexical units, which should have no inhibition. Merge's decision nodes, and the connections from both prelexical and lexical levels to these nodes, are the simplest additions to the logically demanded structures which allow one to describe the available data adequately. The Merge architecture is thus bottom-up and also optimally efficient.
In order to study the behavior of the Merge model, we constructed a simple network
that could be readily manipulated and understood.
The Merge network is a simple competition-activation network with the same basic dynamics
as Shortlist (Norris 1994b). The network has no intrinsic noise as we are interested in
modelling RT in tasks that are performed with high levels of accuracy. We acknowledge that
a noise-free network is unlikely to be suitable for modelling choice behaviour with lower
levels of accuracy, but this is independent of the main architectural issues being dealt
with here (McClelland 1991). As Figure 1 shows, both the word and
phoneme nodes are connected by facilitatory links to the appropriate decision nodes. But
there is no feedback from the word nodes to the prelexical phoneme nodes. In the
simulations of subcategorical mismatch, the network was handcrafted with only 14 nodes:
six input phoneme nodes corresponding to /j/, /
/, /
/, /b/ /v/ and /z/,
four phoneme decision nodes, and four possible word nodes: job, jog, jov and joz
. The latter two word nodes simply represent notional words ending in phonemes
unrelated to either /b/ or /
/.
The basic architecture is shown, together with the connectivity patterns for the node types used in the simulations. Activation spreads from the input nodes to the lexical nodes and to the phoneme decision nodes, and from the lexical nodes to the phoneme decision nodes; inhibitory competition operates at the lexical and phoneme decision levels. Excitatory connections, shown with bold lines and arrows, are unidirectional; inhibitory connections, shown with fine lines and closed circles, are bi-directional.
Figure 1: The Merge model.
The input in these simulations is job . The different experimental conditions
are created by varying the set of word nodes that are enabled. By enabling or disabling
the word node for job the same input can be made to represent either a word or a
nonword, and by altering whether jog is enabled the same stimulus can be given a
lexical competitor or not.
We will present only the activation levels of the units in the network, rather than attempting to model explicitly the mapping of these activations onto response latencies. A problem that arises in simulating reaction-time data in models like Shortlist or TRACE is that activation levels can never be translated directly into latencies. The simplest solution is to assume that responses are made whenever activations pass a given threshold. In presenting their simulations, Marslen-Wilson and Warren (1994) plotted response probabilities derived from the Luce choice rule (Luce 1959) rather than the underlying activations. Response times can then be derived by thresholding these probabilities. But there are a number of problems associated with the use of the Luce choice probabilities in the decision mechanism of a model of continuous speech recognition. One simple problem is that the Luce rule introduces a second `competition' process into the recognition process in addition to the inhibitory competition already present. In the Luce calculations, other active candidates compete with the most active candidate and reduce its response probability. The extent of this competition is dependent on the exponent used in the Luce choice rule. As the exponent increases (this is equivalent to decreasing the amount of noise in the system), so accuracy is increased and the influence of competitors decreases. In Marslen-Wilson and Warren's (1994) TRACE simulations of lexical decision, the error rates for W1W1 stimuli, for example (in Figure 12, p. 669), are about ten times greater than in the human data. A more appropriate choice of exponent would have greatly reduced the competition effect introduced by the Luce rule.
The tasks we are simulating are performed with a high level of accuracy that varies little between conditions. The crucial effects are all reflected in differences in latency rather than accuracy. A simple response threshold, therefore, provides the most straightforward means of deriving a response from variations in activation level, without the additional complexity and assumptions of the Luce rule. Note that although it is easy to make the long term average behavior of a connectionist network follow the Luce choice rule by adding noise and then selecting the node with the largest activation (Page in press; McClelland 1991), there is no such obvious way to derive response probabilities directly from network activations on a single trial. Probabilities can be calculated and used to determine latency (cf. Massaro & Cohen 1991) but this would add a very complicated additional mechanism to any connectionist model. Furthermore, as Shortlist simulations presented in Norris (1994b) show, activations for words in continuous speech often rise transiently to quite high levels before being suppressed by other candidates. In a complete account of the decision process, activation should therefore be integrated over time. However, for present purposes, the simple threshold can serve as a measure of recognition point, or of YES responses in lexical decision. Negative responses in lexical decision are slightly more problematic. Here we adopt the procedure proposed by Grainger and Jacobs (1996) for visual lexical decision. Grainger and Jacobs suggest that NO responses are made after some deadline has elapsed, but that the deadline is extended in proportion to the overall level of lexical activity. Ideally we would have a quantitative implementation of the decision process so that we could fit the complete model directly to the data. However, the decision component of the model would, in itself, require several extra parameters. We therefore present only the activations from the network and show that, given the assumption that decisions are made when activations pass a threshold (or, in the case of NO responses, when a deadline is reached), the patterns are qualitatively consistent with the experimental data.
We present detailed simulation results from the theoretically most critical sets of data described above: The subcategorical mismatch findings of Marslen-Wilson and Warren (1994) and McQueen et al. (in press), and the phoneme monitoring results of Connine et al. (1997), with those of Frauenfelder et al. (1990). These studies provide very highly constraining data against which to evaluate Merge. The subcategorical mismatch data has already led us to reject the Race model, and according to Marslen-Wilson and Warren's simulations, has eliminated TRACE too. It is thus crucial to establish whether Merge can simulate the effects of mismatch, lexical status, and their interaction observed in those studies, and also the dependency of the inhibitory lexical effect on task demands. Likewise, the data of Connine et al. (1997) are important to simulate since they also led us to reject the Race model. This simulation will allow us to establish whether Merge can provide a bottom-up account for graded lexical effects in nonwords. The results of Frauenfelder et al. (1990) are similarly theoretically important because they offer a crucial challenge to TRACE and allow comparison of facilitatory and inhibitory effects in phoneme monitoring.
As stated, the input was always job . In these simulations, only two of the word nodes were ever enabled in a given condition. In the W1W1 and the W2W1 conditions the nodes jog and job were enabled. In the N3W1 condition the nodes jov and job were enabled. For both N1N1 and N3N1, jov and joz were enabled, and for W2N1 jog and joz were enabled. The nodes jov and joz acted to balance the overall potential for lexical competition in the different simulations. They reflect the fact that, in addition to the matched experimental words, the language will usually contain other words beginning with the same two phonemes.
Input to the network consisted of four vectors representing the total for each phoneme node for each time slice. Under normal, unspliced conditions, the input to each phoneme built up over three time slices 0.25, 0.5, 1.0. It then remained at its peak value through the remaining time-slices. The first phoneme began at time-slice 1, the second at time slice 4, and the third at slice 7. This form of input is analogous to that used in Shortlist, where each phoneme produces the same bottom-up activation regardless of its position in the input sequence. Experiments were also carried out in which activation reached a peak and then decayed symmetrically. However, although this kind of input required different parameters, it made little difference to the qualitative nature of the simulations.
For the cross-splice conditions we assumed that the total support from the competing
/b/ and /
/ phonemes remained 1.0. At slice 7 the input
for /
/ was 0.15 where it stayed for the remainder of the input. The input for
/b/ had 0.15 subtracted from all slices from 7 onwards. So, according to this scheme, a
/b/ in a cross-spliced condition reached a final activation value of only 0.85 instead of
1.0 while there was an input of 0.15 to the competing /
/ phoneme from the
cross-splice onwards. The main aspect of the input representations that alters the outcome
of the simulations is the magnitude of the support for the competing phoneme in the
cross-splice condition. If the cross-splice support for the /
/ is weighted more
heavily relative to the /b/ (e.g. 0.25 vs 0.75), the simulations will be more likely to
show equivalent effects of the splice for words and nonwords across the board. With too
small a cross-splice, the effects will disappear altogether. However, there is a large
range of parameters in between where the general behavior of the model remains similar.
As noted above, any effective decision mechanism operating on the lexical output of Shortlist would need to perform some integration over time. However, although we did not introduce an integrating decision mechanism at the lexical level in Merge, we did find it necessary to add some integration, or momentum, at the phoneme decision nodes. The decision nodes were run in the Shortlist-style reset-mode (Norris, McQueen & Cutler 1995). That is, the activation levels for these nodes were reset to zero at the start of each new time slice. On its own this generally leads to a rapid and almost complete recovery from the effect of the cross-splice. In order to make the decision process sensitive to the activation history, some proportion of the final level of activation on each slice was fed in along with the input to the next slice. This `momentum' term controlled the extent to which the decision nodes integrated their input over time. In the simulations both the word and decision levels cycled through 15 iterations of the network for each time-slice of input. However, because the input phoneme units had no between-phoneme inhibition this level did not cycle. Phoneme activations for each slice were calculated by multiplying the previous activation by the phoneme level decay and then adding in the new input.
The time (time-slice numbers estimated by interpolation from Figures 2 and 3) at which
nodes attained the criterial threshold of 0.2 (lexical node for YES lexical decisions) or
0.4 (phonemic decision node for phonemic decisions) in each condition in the Merge model
simulations of the subcategorical mismatch data. The RT data (mean RT in ms), from both
Marslen-Wilson and Warren (1994, MWW, Expt.'s 1 and 3, voiced stops) and McQueen, Norris
and Cutler (in press, MNC, Expt.'s 3 and 4), are also shown.
Lexical Decision
Condition Lexical node MWW MNC threshold reached at Expt.1 Expt. 3 Word W1W1 7.7 487 340 W2W1 9.7 609 478 N3W1 9.4 610 470
Phonetic Decision
Phoneme decision node MWW MNC threshold reached at Expt.1 Expt. 3 Word W1W1 8.4 497 668 W2W1 10.4 610 804 N3W1 10.4 588 802 Nonword N1N1 8.8 521 706 W2N1 11.8 654 821 N3N1 10.7 590 794
Note. No data are given for the NO lexical decisions to nonwords, since these decisions
are based not on activation thresholds, but on a response deadline being reached (see text
for details).
Figure 2. Simulation of lexical decisions in the subcategorical mismatch experiments. In all cases, the labels refer to the conditions used in those experiments, as shown in Table 1. Figure 2a shows the activation levels for lexical nodes given un-spliced job as input. ``W1W1'' shows the activation of the job -node when it was switched on as a possible word, and ``W1W1 comp'' shows the activation of the node for the lexical competitor jog which was also switched on in this simulation. ``N1N1'' shows the activation of the jov -node when neither job nor jog were words. Figure 2b shows the activation levels for lexical nodes given cross-spliced job as input (i.e. with information in the vowel consistent with a following /
/). ``W2W1'' and ``W2W1 comp'' show the activation levels of the job - and jog -nodes, respectively, when both were switched on as words. ``N3W1'' shows the activation of the job -node when job was a word and jog was a nonword. ``N3W1 comp'' thus shows the activation of the other activated word in this condition, that of jov . ``W2N1'' shows the activation of the jog -node when jog was switched on as a word, but job was not. Finally, ``N3N1'' shows the activation of the jov -node when neither job nor jog were words.
Figure 3. Simulation of phonemic decisions in the subcategorical mismatch experiments. In all cases, the labels refer to the conditions used in those experiments, as shown in Table 1. Figure 3a shows the activation levels for the /b/ and /
/ phoneme decision nodes given un-spliced job as input. ``W1W1'' shows the activation of the /b/-node when job was switched on as a possible word in the lexicon, and ``W1W1 comp'' shows the activation of the /
/-node, corresponding to the lexical competitor jog , which was also switched on in this simulation. ``N1N1'' and ``N1N1 comp'' show the activations of the /b/- and /
/-nodes, respectively, when neither job nor jog were words. Figure 3b shows the activation levels for /b/ and /
/ phoneme decision nodes given cross-spliced job as input (i.e. with information in the vowel consistent with a following /
/). ``W2W1'' and ``W2W1 comp'' show the activations of the /b/- and /
/-nodes, respectively, when both job and jog were switched on as words. ``N3W1'' shows the activation of the /b/-node when job was a word and jog was a nonword. ``N3W1 comp'' shows the activation of the /
/-node in this condition. ``W2N1'' shows the activation of the /b/-node when jog was switched on as a word, but job was not, while ``W2N1 comp'' shows the activation of the /
/-node in this condition. Finally, ``N3N1'' and ``N3N1 comp'' show the activation levels of the /b/- and /
/-nodes when neither job nor jog were words.
The set of parameters which produced the simulations shown in Figures 2 and 3 is listed in the Appendix. The most difficult aspect of the simulations was to find parameters that would give a good account of the phonetic categorization data. The basic pattern for the lexical decision data was robust and could be reproduced by a wide range of parameters. Correct adjustment of the momentum term in the decision units proved to be critical for these simulations. Note that the simulations never produced the opposite pattern from that observed in the data, but often the phonetic categorization responses to N3N1 and W2N1 did not differ significantly (as in McQueen et al.'s experimental data).
Figure 2 provides simulations of the lexical decision data. Figure 2a shows the activation functions for lexical nodes given
unspliced job as input; Figure 2b shows lexical activation
given cross-spliced job as input (i.e. a token containing information in the vowel
specifying a /
/ instead of a /b/). In unspliced W1W1, job
is a word in the Merge lexicon, and its activation rises quickly to an asymptote near
0.25. If we assume a response threshold of 0.2, lexical decisions should be faster in this
condition than with the cross-spliced words W2W1 and N3W1. This reflects the basic
mismatch effect observed for words in the human data, as shown in Table
2. With the same response threshold, as also shown in Table 2,
there will be almost no difference in the response times to words in the two cross-spliced
conditions, again as in the human data. In the nonword conditions, where job is not
enabled as a word in the lexicon, there is effectively no activation of any lexical nodes
for both unspliced N1N1 and cross-spliced N3N1; the model thus captures the fact that
there was little effect of mismatching information in the nonword data when the
cross-splice involved another nonword. In the W2N1 condition, however, the activation of
W2 ( jog ) remains high throughout the target. According to the Grainger and Jacobs
(1996) decision rule, the increased activation in the W2N1 case will delay the deadline
and lead to slower responding, exactly as seen in the data.
Figure 3 shows the activation functions for the phoneme
decision nodes, in simulation of the phonetic categorization data. Figure
3a shows that there is in the unspliced items only a relatively weak lexical effect.
Activation of /b/ rises somewhat more rapidly and attains a higher asymptote when job is
a word in the Merge lexicon (W1W1), than when it is not a word in the lexicon (N1N1). As
shown in Table 2, this small facilitative lexical effect is in line
with the experimental data; the effect was significant in Marslen-Wilson and Warren (1994)
but not in McQueen et al. (in press). In the cross-spliced phonetic categorization
simulations (Figure 3b and Table 2), a
threshold set at 0.4 would result in almost no differences between W2W1, N3W1, and N3N1
response times. The activation functions for these conditions are almost identical at this
point. But the activation of /b/ reaches 0.4 in all three of these conditions later than
in the unspliced conditions (Figure 3a and Table
2); this is the basic mismatch effect observed in phonetic categorization in both
words and nonwords. The model therefore correctly predicts no difference between the two
types of cross-spliced word; it also correctly predicts an inhibitory lexical effect in
the cross-spliced nonwords. The activation of /b/ in W2N1 grows more slowly than the
others; this is because of the activation of the /
/ node (W2N1 comp),
which receives support from the lexical node for W2 ( jog ; its activation is
plotted in Figure 2b). Thus, only the nonwords show an effect of
lexical competition. The model therefore gives an accurate characterization of the
competition effects in both the lexical decision and the phonetic categorization tasks,
and provides an account of why competition effects are only observed in the case of
nonwords.
Given the architecture of the network, any factor that either reduces lexical activation, or the strength of the connections from the lexical to the decision nodes, will clearly reduce the size of the lexical effects. Merge thus copes naturally with data showing that lexical effects vary with changes in the task, both those on subcategorical mismatch (McQueen et al. in press) and the other effects of variability reviewed in section 4.1. When task demands discourage the use of lexical knowledge, decisions will be made on the basis of the prelexical route and lexical activation will simply not contribute to decision node activation.
Although the network might be thought of as permanently connected (as in TRACE), we prefer to view Merge as having the same architecture as Shortlist (Norris 1994b), in which the lexical network is created dynamically as required. This means that the word nodes cannot be permanently connected to the decision nodes. Instead, the connections from the lexical nodes to the phoneme decision nodes must be built on the fly, when the listener is required to make phonemic decisions. In the Merge model, therefore, the demands of the experimental situation determine how the listener chooses to perform the task. If task demands encourage the use of lexical knowledge, connections will be built from both prelexical and lexical levels to the decision nodes. But if the use of lexical knowledge is discouraged, only connections from the prelexical level will be constructed. Task demands could similarly result in decision nodes only being employed when they correspond to possible experimental responses. In a standard phoneme monitoring experiment, with only a single target, there might only be a single decision node. If so, there could never be an inhibitory effect in phoneme monitoring because there could never be competing responses.
Connine et al. (1997) showed that phoneme monitoring responses to final targets in nonwords which were more like words (e.g. /t/ in gabinet , which differs from cabinet only in the voicing feature of the initial stop) were faster than those to targets in nonwords which were less like words (e.g. mabinet , with a larger featural mismatch in initial position), which in turn were faster than those to targets in control nonwords (not close to any real word, e.g. shuffinet ). This graded lexical involvement in phoneme monitoring in nonwords cannot be explained by the Race model. We have already seen that Merge can simulate a simple lexical advantage; can it also simulate graded effects?
As most of Connine et al.'s stimuli were several phonemes long we added two more phonemes to the network so that we could simulate processing words and nonwords that were 5 phonemes in length. To simulate Connine et al.'s word condition ( cabinet ) the lexicon contained a single 5-phoneme word and that word was presented as input. For the multi-feature-change nonword condition ( mabinet ), the input simply had a different initial phoneme that did not activate the initial phoneme of the word at all. That is, there was no perceptual similarity between the initial phoneme of the word and the multi-feature-change nonword. For the control nonword ( shuffinet ) the input was actually identical to that used in the word condition, but there were no words in the lexicon. Figure 4 shows the results of this simulation, which uses exactly the same parameters as the previous simulation.
Figure 4. Simulation of Connine et al. (1997). The activation of the phoneme decision node for /b/ is shown in three conditions corresponding to the original study, ``Word'', ``Multi-feature-change Nonword'', and ``Control Nonword''. In all three conditions /b/ was the final phoneme of the input (i.e. the target phoneme).
It can be seen that activation of the final target phoneme rises more slowly in the multi-feature-change nonword than in the word, but most slowly of all in the control nonword. Note that we simulated only the multi-feature-change nonwords, as the positioning of the single-feature-change nonwords (between words and multi-feature-change nonwords) depends almost entirely on how similar we make the input representations of the initial phonemes and the initial phonemes of the words. Note also that the exact amount of lexical benefit in these experiments should further depend on the pattern of lexical competitors. To the extent that nonwords elicit more competitors not sharing the target phoneme, facilitation should decrease.
Because lexical activation is a function of the goodness of match between lexical representations and the input, and because lexical activation is fed continuously to the phoneme decision nodes, Merge can thus explain Connine et al.'s (1997) data. Nonwords which are more like words tend to activate lexical representations more than nonwords which are less like words, and this increases word-node to decision-node excitation, so that phoneme decisions to targets in more word-like nonwords will tend to be faster than those to targets in less word-like nonwords.
We also tested Merge's ability to account for the phoneme monitoring data from Frauenfelder et al. (1990). These findings allow us to examine Merge's ability to simulate both facilitatory lexical effects in phoneme monitoring, and the absence of inhibitory effects when materials contain no subcategorical mismatches. Even though Merge combines lexical and prelexical, the use of the bottom-up priority rule means that Merge correctly accounts for the results of the Frauenfelder et al. study. Frauenfelder et al. found no difference in phoneme monitor