Below is the unedited penultimate draft of:
Levelt, W.J.M., Roelofs, A., & Meyer, A.S. (19XX). A theory of lexical access in speech production. Behavioral and Brain Sciences, XX (X): XXX-XXX.
The final published draft of the target article, commentaries and Author's Response currently available only in paper.
For information about subscribing or purchasing offprints of the published version, with commentaries and author's response, write to: journals_subscriptions@cup.org (North America) or journals_marketing@cup.cam.ac.uk (All other countries).

A THEORY OF LEXICAL ACCESS IN SPEECH PRODUCTION

Willem J.M. Levelt, Ardi Roelofs & Antje S. Meyer
Max Planck Institute for Psycholinguistics
P.O. Box 310
6500 AH Nijmegen
The Netherlands
pim@mpi.nl

KEYWORDS

speaking, lexical access, conceptual preparation, lexical selection, morphological encoding, phonological encoding, syllabification, articulation, self-monitoring, lemma, morpheme, phoneme, speech error, magnetic encephalography, readiness potential, brain imaging.

ABSTRACT

Preparing words in speech production is, normally, a fast and accurate process. We generate them two or three per second in fluent conversation, and overtly naming a clear picture of an object can easily be initiated within 600 ms after picture onset. The underlying process, however, is exceedingly complex. The theory reviewed in this target article analyzes this process as staged and feedforward. After a first stage of conceptual preparation, word generation proceeds through lexical selection, morphological and phonological encoding, phonetic encoding and articulation itself. In addition, the speaker exerts some degree of output control, by monitoring of self-produced internal and overt speech. The core of the theory, ranging from lexical selection to the initiation of phonetic encoding, is captured in a computational model, called WEAVER++. Both the theory and the computational model have been developed in interaction with reaction time experiments, particularly in picture naming or related word production paradigms with the aim of accounting for the real-time processing in normal word production. A comprehensive review of theory, model and experiments are presented. The model can handle some of the main observations in the domain of speech errors (the major empirical domain for most other theories of lexical access), and the theory also opens new ways of approaching the cerebral organization of speech production by way of high-resolution temporal imaging.

1. AN ONTOGENETIC INTRODUCTION

Infants (from Latin infans - speechless) are human beings who cannot speak. It took most of us the whole first year of our life to overcome this infancy and to produce our first few meaningful words. But we haven't been idle as infants. We worked, rather independently, on two basic ingredients of word production. On the one hand, we established our primary notions of agency, interactancy, the temporal and causal structure of events, object permanence and location. This provided us with a matrix for the creation of our first lexical concepts, concepts flagged by way of a verbal label. Initially, these word labels were exclusively auditory patterns, picked up from the environment. On the other hand, we created a repertoire of babbles, a set of syllabic articulatory gestures. These motor patterns normally spring up around the seventh month. The child carefully attends to their acoustic manifestations, leading to elaborate exercises in the repetition and concatenation of these syllabic patterns. In addition, these audio-motor patterns start resonating with real speech input, becoming more and more tuned to the mother tongue (De Boysson-Bardies & Vihman, 1991; Elbers, 1982). These exercises provided us with a proto-syllabary, a core repository of speech motor patterns, which were, however, completely meaningless.

Real word production begins when the child starts connecting some particular babble (or a modification thereof) to some particular lexical concept. The privileged babble auditorily resembles the word label that the child has acquired perceptually. Hence, word production emerges from a coupling of two initially independent systems, a conceptual system and an articulatory motor system.

This duality is never lost in the further maturation of our word production system. Between the ages of 1;6 and 2;6 the explosive growth of the lexicon soon overtaxes the proto-syllabary. It is increasingly hard to keep all the relevant whole-word gestures apart. The child conquers this strain on the system by dismantling the word gestures through a process of phonemization; words become generatively represented as concatenations of phonological segments (Elbers & Wijnen, 1992; C. Levelt, 1994). As a consequence, phonetic encoding of words becomes supported by a system of phonological encoding. Adults produce words by spelling them out as a pattern of phonemes and as a metrical pattern. This more abstract representation in turn guides phonetic encoding, the creation of the appropriate articulatory gestures.

The other, conceptual root system becomes overtaxed as well. When the child begins to create multi-word sentences, word order is entirely dictated by semantics, i.e. by the prevailing relations between the relevant lexical concepts. One popular choice is "agent first", another one is "location last". But by the age of 2;6 this simple system starts foundering when increasingly complicated semantic structures present themselves for expression. Clearly driven by a genetic endowment, children restructure their system of lexical concepts by a process of syntactization. Lexical concepts acquire syntactic category and subcategorization features, verbs acquire specifications of how their semantic arguments (such as agent or recipient) are to be mapped onto syntactic relations (such as subject or object), nouns may acquire properties for the regulation of syntactic agreement, such as gender, etc. More technically speaking, the child develops a system of lemmas[1], packages of syntactic information, one for each lexical concept. At the same time, the child quickly acquires a closed class vocabulary, a relatively small set of frequently used function words. These words mostly fulfill syntactic functions; they have elaborate lemmas but lean lexical concepts. This system of lemmas is largely up and running by the age of four. From then on, producing a word always involves the selection of the appropriate lemma.

The original two-pronged system thus develops into a four-tiered processing device. In producing a content word, we as adult speakers first go from a lexical concept to its lemma. After retrieval of the lemma, we turn to the word's phonological code and use it to compute a phonetic-articulatory gesture. The major rift in the adult system still reflects the original duplicity in ontogenesis. It is between the lemma and the word form, i.e., between the word's syntax and its phonology, as is apparent from a range of phenomena, such as the tip-of-the-tongue state (Levelt, 1993).

2. SCOPE OF THE THEORY

In the following, we will first outline this word producing system as we conceive it. We will then turn in more detail to the four levels of processing involved in the theory, the activation of lexical concepts, the selection of lemmas, the morphological and phonological encoding of a word in its prosodic context and, finally, the word's phonetic encoding. In its present state, the theory doesn't cover the word's articulation. Its domain extends no further than the initiation of articulation. Although we have recently been extending the theory to cover aspects of lexical access in various syntactic contexts (Meyer, 1996), the present paper will be limited to the production of isolated prosodic words4.

Every informed reader will immediately see that the theory is heavily indebted to the pioneers of word production research, among them Vicky Fromkin, Merrill Garrett, Stephanie Shattuck-Hufnagel and Gary Dell (see Levelt, 1989, for a comprehensive and therefore more balanced review of modern contributions to the theory of lexical access). It is probably in only one major respect that our approach is different from the classical studies. Rather than basing our theory on the evidence from speech errors, spontaneous or induced, we have almost exclusively developed and tested our notions by means of reaction time research. We felt this to be a necessary addition to existing methodology for a number of reasons. Models of lexical access have always been conceived as process models of normal speech production. Their ultimate test, we argued in Levelt et al. (1991b) and Meyer (1992), cannot lie in how they account for infrequent derailments of the process, but rather in how they deal with the normal process itself. Reaction time studies, of object naming in particular, can bring us much closer to this ideal. First, object naming is a normal, everyday activity indeed, and roughly one fourth of an adult's lexicon consists of names for objects. We admittedly start tampering with the natural process in the laboratory, but that hardly ever results in substantial derailments, such as naming errors or tip-of-the-tongue states. Second, reaction time measurement is still an ideal procedure for analyzing the time course of a mental process (with evoked potential methodology as a serious competitor). It invites the development of real-time process models, which not only predict the ultimate outcome of the process, but also account for a reaction time as the resultant of critical component processes.

Reaction time (RT) studies of word production have been around since the seminal studies of Oldfield and Wingfield (1965) and Wingfield (1968; see Glaser, 1992, for a review), and RT methodology is now widely used in studies of lexical access. Still, the theory to be presented here is rather unique in that its empirical scope is in the temporal domain. This has required a rather different type of modeling than is customary in the domain of error-based theories. It would be a misunderstanding though, to consider our theory as neutral with respect to speech errors. Not only has our theory construction always taken inspiration from speech error analyses, but ultimately, the theory should be able to account for error patterns as well as for production latencies. First efforts in that direction will be discussed in Section 10 of this paper.

Finally, we do not claim completeness for the theory. It is tentative in many respects and in need of further development. We have, for example, a much better understanding of access to open class words than of access to closed class words. However, we do believe that the theory is productive in that it generates new, non-trivial, but testable predictions. In the following we will indicate such possible extensions when appropriate.

3. THE THEORY IN OUTLINE

3.1 Processing Stages

The flow diagram in Figure 1 presents the theory in outline. The production of words is conceived as a staged process, leading from conceptual preparation to the initiation of articulation. Each stage produces its own characteristic output representation. They are, respectively, lexical concepts, lemmas, morphemes, phonological words and, finally, phonetic gestural scores (which are executed during articulation). In the following it will be a recurring issue whether these stages overlap in time or are strictly sequential, but here we will restrict ourselves to a summary description of what each of these processing stages is supposed to achieve.

3.1.1 Conceptual preparation

All open class words and most closed class words are meaningful. The intentional[2] production of a meaningful word always involves the activation of its lexical concept. The process leading up to the activation of a lexical concept is called "conceptual preparation". But there are many roads to Rome. In everyday language use, a lexical concept is often activated as part of a larger message that captures the speaker's communicative intention (Levelt, 1989). If a speaker intends to refer to a female horse, he may effectively do so by producing the word "mare", which involves the activation of the lexical concept MARE(X). But if the intended referent is a female elephant, the English speaker will resort to a phrase, such as "female elephant", because there is no unitary lexical concept available for the expression of that notion. A major issue, therefore, is how the speaker gets from the notion/information to be expressed to a message that consists of lexical concepts (here "message" is the technical term for the conceptual structure that is ultimately going to be formulated). This is called the verbalization problem and there is no simple one-to-one mapping of notions-to-be- expressed onto messages (Bierwisch & Schreuder, 1992). But even if a single lexical concept is formulated, as is usually the case in object naming, this indeterminacy still holds, because there are multiple ways to refer to the same object. In picture naming, the same object may be called "animal", "horse", "mare", or what have you, dependent on the set of alternatives and on the task. This is called perspective taking. There is no simple, hard-wired connection between percepts and lexical concepts. That transition is always mediated by pragmatic, context-dependent considerations. Our work on perspective taking has, till now, been limited to the lexical expression of spatial notions (Levelt, 1996).

Apart from these distal, pragmatic causes of lexical concept activation, our theory recognizes more proximal, semantic causes of activation. This part of the theory has been modeled by way of a conceptual network (Roelofs, 1992 a,b), to which we will return in Sections 3 and 4.1. The top layer of Figure 2 represents a fragment of this network. It depicts a concept node, ESCORT (X, Y), which stands for the meaning of the verb escort. It links up to other concept nodes, such as ACCOMPANY (X,Y), and the links are labeled to express the character of the connection (in this case IS-TO, because to ESCORT (X,Y) is to ACCOMPANY (X,Y)). In this network concepts will spread their activation via such links to semantically related concepts. This mechanism is at the core of our theory of lexical selection, as developed in Roelofs (1992a). A basic trait of this theory is its non-decompositional character. Lexical concepts are not represented by sets of semantic features, because that creates a host of counter-intuitive problems for a theory of word production. One is what Levelt (1989) has called the hyperonym problem. When a word's semantic features are active, then, per definition, the feature sets for all of its hyperonyms or superordinates are active (they are subsets). Still, there is not the slightest evidence that speakers tend to produce hyperonyms of intended target words. Another problem is the non-existence of a semantic complexity effect. It is not the case that words with more complex feature sets are harder to access in production than words with simpler feature sets (Levelt et al., 1978). These and similar problems vanish when lexical concepts are represented as undivided wholes.

The conceptual network's state of activation is also measurably sensitive to the speaker's auditory or visual word input (Levelt & Kelter, 1982). This is, clearly, another source of lexical concept activation. This possibility has been exploited in many of our experiments, in which a visual or auditory distractor word is presented while the subject is naming a picture.

Finally, Dennett (1991) suggested a pandemonium-like spontaneous activation of words in the speaker's mind. Although we haven't modeled this, there are three ways to implement such a mechanism. The first one would be to add spontaneous, statistical activation to lexical concepts in the network. The second one would be to do the same at the level of lemmas, whose activation can be spread back to the conceptual level (see below). And the third one would be to implement spontaneous activation of word forms; their resulting morpho-phonological encoding would then feed back as internal speech (see Figure 1) and activate the corresponding lexical concepts.

3.1.2 Lexical selection

Lexical selection is retrieving a word, or more specifically a lemma from the mental lexicon, given a lexical concept to be expressed. In normal speech, we retrieve some two or three words per second from a lexicon that contains tens of thousands of items. This high-speed process is surprisingly robust; errors of lexical selection occur in the 1 per thousand range. Roelofs (1992a) modeled this process by attaching a layer of lemma nodes to the conceptual network, one lemma node for each lexical concept. An active lexical concept spreads some of its activation to "its" lemma node and lemma selection is a statistical mechanism, which favors the selection of the highest activated lemma. Although this is the major selection mechanism, the theory does allow for the selection of function words on purely syntactic grounds (as in "John said that ...", where the selection of that is not conceptually, but syntactically driven). Upon selecting a lemma, its syntax becomes available for further grammatical encoding, i.e., creating the appropriate syntactic environment for the word. For instance, retrieving the lemma escort will make available that this is a transitive verb (node Vt(x,y) in Figure 2) with two argument positions (x and y), corresponding to the semantic arguments X and Y, etc.[3]

Many lemmas have so-called "diacritic parameters" that have to be set. For instance, in English, verb lemmas have features for number, person, tense and mood (see Figure 2). It is obligatory for further encoding that these features are valued. The lemma escort, for instance, will be phonologically realized as escort, escorts, escorted, escorting, dependent on the values of its diacritic features. The values of these features will in part derive from the conceptual representation. For example, tense being an obligatory feature in English, the speaker will always have to check the relevant temporal properties of the state of affairs being expressed. Notice that this need not have any communicative function. Still this extra bit of thinking has to be done in preparation of any tensed expression. Slobin (1987) usefully called this "thinking for speaking". For another part, these diacritic feature values will be set during grammatical encoding. A verb's number feature, for instance, is set by agreement, in dependence on the sentence subject's number feature. Here we must refrain from discussing these mechanisms of grammatical encoding (but see Levelt, 1989, Bock & Miller,1991, and Bock & Levelt, 1994, for details).

3.1.3 Morpho-phonological encoding and syllabification

After having selected the syntactic word or lemma, the speaker is about to cross the rift mentioned above, going from the conceptual/syntactic domain to the phonological/articulatory domain. The task is now to prepare the appropriate articulatory gestures for the selected word in its prosodic context, and the first step here is to retrieve the word's phonological shape from the mental lexicon. Crossing the rift is not an entirely trivial matter. The tip-of-the-tongue phenomenon is precisely the momentary inability to retrieve the word form, given a selected lemma. Levelt (1989) predicted that in a tip-of-the tongue state the word's syntactic features should be available in spite of the blockage, because they are lemma properties. In particular, a Dutch or an Italian speaker should know the grammatical gender of the target word. This has recently been experimentally demonstrated by Vigliocco et al. (1997) for Italian speakers. Similarly, certain types of anomia involve the same inability to cross this chasm. Badecker et al. (1995) showed this to be the case for an Italian anomic patient, who could hardly name any picture, but always knew the target word's grammatical gender. But even if word form access is unhampered, it is a lot harder for infrequent words than for frequent words; the difference in naming latency easily amounts to 50-100 milliseconds. Jescheniak and Levelt (1994) showed that word form access is the major, and probably unique locus of the word frequency effect (discovered by Oldfield and Wingfield, 1965).

According to the theory, accessing the word form means activation of three kinds of information, the word's morphological make-up, its metrical shape and its segmental make-up. For example, if the lemma is escort, diacritically marked for progressive tense, the first step is to access the two morphemes <escort> and <ing> (see Figure 2). Then, the metrical and segmental properties of these morphemes will be "spelled out". For escort, the metrical information involves that the morpheme is iambic, i.e., that it is disyllabic and stress-final, and that it can be a phonological word[4] (4) itself. For <ing> the spelled out metrical information is that it is a monosyllabic, unstressed morpheme, which cannot be an independent phonological word (i.e., it must become attached to a phonological head, which in this case will be escort). The segmental spell- out for <escort> will be /_/[5], /s/, /k/, /_/, /r/, /t/, and for <ing> it will be /_/, /_/ (see Figure 2). Notice that there are no syllables at this level. The syllabification of the phonological word escort is e-scort, but this is not stored in the mental lexicon. In the theory, syllabification is a late process, because it often depends on the word's phonological environment. In escorting, for instance, the syllabification is different: e- scor-ting, where the syllable ting straddles the two morphemes escort and ing. One might want to argue that the whole word form escorting is stored, including its syllabification. However, syllabification can also transcend lexical word boundaries. In the sentence He'll escort us, the syllabification will usually be e-scor-tus. It is highly unlikely that this cliticized form is stored in the mental lexicon. An essential part of the theory, then, is its account of the syllabification process. We have modeled this process by assuming that a morpheme's segments or phonemes become simultaneously available, but with labeled links indicating their correct ordering (see Figure 2). The word's metrical template may stay as it is, or be modified in the context. In the generation of escorting (or escort us, for that matter), the "spelled out" metrical templates for <escort>, &&', and for <ing> (or <us>), &, will merge to form the trisyllabic template &&'&. The spelled-out segments are successively inserted into the current metrical template, forming phonological syllables "on the fly": e-scor-ting (or e- scor-tus). This process follows quite universal rules of syllabification (such as maximization of onset and sonority gradation - see below) as well as language-specific rules. There can be no doubt that these rules are there to create maximally pronounceable syllables. The domain of syllabification is called the "phonological" or "prosodic word" (4). Escort, escorting, escortus can be phonological words, i.e. domains of syllabification. Some of the phonological syllables in which escort, in different contexts, can participate are represented in Figure 2. If the current phonological word is escorting, the relevant phonological syllables, e, scor, and ting, with word accent on scor, will activate the phonetic syllable scores [_], [sk_r], and [t__].

3.1.4 Phonetic encoding

The theory has an only partial account of phonetic encoding. The theoretical aim is to explain how a phonological word's gestural score is computed. It is a specification of the articulatory task that will produce the word, in the sense of Browman and Goldstein (1992)[6]. This is a, still rather abstract, representation of the articulatory gestures to be performed at different articulatory tiers, a glottal, a nasal, and an oral tier. One task, for instance, on the oral tier would be to close the lips (as should be done in a word like apple). The gestural score is abstract in that the way in which a task is performed is highly context dependent. Closing the lips after [oe], for instance, is a quite different gesture than closing the lips after rounded [u].

Our partial account involves the notion of a syllabary. We assume that a speaker has access to a repository of gestural scores for the frequently used syllables of the language. Many, though by no means all of the co-articulatory properties of a word are syllable-internal. There is probably more gestural dependency within a word's syllables than between its syllables (Browman & Goldstein, 1988; Byrd, 1995, 1996). More importantly, as we will argue, speakers of English or Dutch - languages with huge numbers of syllables - do most of their talking with no more than a few hundred syllables. Hence, it would be functionally advantageous for a speaker to have direct access to these frequently used and probably internally coherent syllabic scores. In the theory they are highly overlearned gestural patterns, which need not be re-computed time and again. Rather, they are ready-made in the speaker's syllabary. In our computational model, these syllabic scores are activated by the segments of the phonological syllables. For instance, when the active /t/ is the onset of the phonological syllable /ti_/, it will activate all syllables in the syllabary that contain [t], and similarly for the other segments of /ti_/. A statistical procedure will now favor the selection of the gestural score [ti_] among all active gestural scores (cf. Section 6.3), whereas selection failures are prevented by the model's binding-by-checking mechanism (Section 3.2.3). As phonological syllables are successively composed (as discussed in the previous section), the corresponding gestural scores are successively retrieved. According to the present, partial, theory, the phonological word's articulation can be initiated as soon as all of its syllabic scores have been retrieved.

This, obviously, cannot be the full story. First, the speaker can compose entirely new syllables (for instance in reading aloud a new word or non-word). It should be acknowledged, though, that it is a very rare occasion indeed that an adult speaker of English produces a new syllable. Second, there may be more phonetic interaction between adjacent syllables within a word than between the same adjacent syllables that cross a word boundary. Explaining this would either require larger, word-size stored gestural scores, or an additional mechanism of phonetic composition (or both).

3.1.5 Articulation

The phonological word's gestural score is, finally, executed by the articulatory system. The functioning of this system is beyond our present theory. The articulatory system is, of course, not just the muscular machinery that controls lungs, larynx and vocal tract. It is as much a computational neural system that controls the execution of abstract gestural scores by this highly complex motor system (see Levelt, 1989, for a review of motor control theories of speech production and Jeannerod, 1994, for a neural control theory of motor action).

3.1.6. Self-monitoring

The person we listen to most is ourself. We can and do monitor our overt speech output. Just as we can detect trouble in our interlocutor's speech, we can discover errors, dysfluencies or other problems of delivery in our own overt speech. This, obviously, involves our normal perceptual system (see Figure 1). So far, this ability is irrelevant for our present purposes. Our theory extends to the initiation of articulation, not beyond. But this is not the whole story. It is apparent from spontaneous self-repairs that we can also monitor our "internal speech" (Levelt, 1983), i.e., we can monitor some internal representation as it is produced during speech encoding. This may have some relevance for the latency of spoken word production, because the process of self-monitoring may affect encoding duration. In particular, such self-monitoring processes may be more intense in experiments where auditory distractors are presented to the subject. More important, though, is the possibility to exploit this internal self-monitoring ability to trace the process of phonological encoding itself. A crucial issue here is the nature of "internal speech". What kind of representation or code is it that we have access to when we monitor our "internal speech"? Levelt (1989) proposed that it is a phonetic representation, the output of phonetic encoding. Wheeldon and Levelt (1995), however, obtained experimental evidence for the speaker's ability to also monitor a slightly more abstract, phonological representation (in accordance with an earlier suggestion by Jackendoff, 1987). If this is correct, it gives us an additional means of studying the speaker's syllabification process (see Section 9). But it also forces us to modify the original theory of self-monitoring, which involved phonetic representations and overt speech.

3.2 General Design Properties

3.2.1 Network structure

As is already apparent from Figure 2, the theory is modeled in terms of an essentially feedforward activation-spreading network. In particular, Roelofs (1992a, 1993, 1994, 1996a, 1996b, in press-a) instantiated the basic assumptions of the theory in a computational model that covers the stages from lexical selection to syllabary access. The word-form encoding part of this computational model is called WEAVER (Word-form Encoding by Activation and VERification, see Roelofs 1996a, 1996b, in press-a) whereas the full model, including lemma selection is now called WEAVER++.

WEAVER++ integrates a spreading-activation based network with a parallel object-oriented production system, in the tradition of Collins and Loftus (1975). The structure of lexical entries in WEAVER++ was already illustrated in Figure 2 for the word "escort". There are four strata of nodes in the network. The first one is a conceptual stratum, which contains concept nodes and labeled conceptual links. A subset of these concepts are lexical concepts; they have links to lemma nodes in the next stratum. Each lexical concept, for example ESCORT(X,Y), is represented by an independent node. The links specify conceptual relations, for example between a concept and its superordinates, such as IS-TO-ACCOMPANY(X,Y). A word's meaning or, more precisely, sense is represented by the total of the lexical concept's labeled links to other concept nodes. Although the modeling of the conceptual stratum is highly specific to this model, no deep ontological claims about "network semantics" are intended. We only need a mechanism that ultimately provides us with a set of active, non-decomposed lexical concepts.

The second stratum contains lemma nodes, such as escort, syntactic property nodes, such as Vt(x,y), and labeled links between them. Each word in the mental lexicon, simple or complex, content word or function word, is represented by a lemma node. The word's syntax is represented by the labeled links of its lemma to the syntax nodes. Lemma nodes have diacritics, which are slots for the specification of free parameters, such as person, number, mood or tense, that are valued during the process of grammatical encoding. More generally, the lemma stratum is linked to a set of procedures for grammatical encoding (not to be discussed here).

After a lemma's selection, its activation spreads to the third stratum, the word form stratum. The word-form stratum contains morpheme nodes and segment nodes. Each morpheme node is linked to the relevant segment nodes. Notice that links to segments are numbered (see Figure 2). The segments linked to escort are also involved in the spell-out of other word forms, for instance Cortes, but then the links are numbered differently. The links between segments and syllable program nodes specify possible syllabifications. A morpheme node can also be specified for its prosody, the stress pattern across syllables. Related to this morpheme/segment stratum is a set of procedures that generate a phonological word's syllabification, given the syntactic/phonological context. There is no fixed syllabification for a word, as was discussed above. Figure 2 represents one possible syllabification of escort, but we could have chosen another; /sk_rt/, for instance would have been a syllable in the citation form of escort. The bottom nodes in this stratum represent the syllabary addresses. Each node corresponds to the gestural score of one particular syllable. For escorting these are the phonetic syllables [_], [sk_r] and [t__].

What is a "lexical entry" in this network structure? Keeping as close as possible to the definition in Levelt (1989, p. 182), a lexical entry is an item in the mental lexicon, consisting of a lemma, its lexical concept (if any), and its morphemes (one or more) with their segmental and metrical properties.

3.2.2 Competition but no inhibition.

There are no inhibitory links in the network, either within or between strata. That doesn't mean that node selection is not subject to competition within a stratum. At the lemma and syllable levels the state of activation of non-target nodes does affect the latency of target node selection, following a simple mathematical rule (see Appendix).

3.2.3 Binding.

Any theory of lexical access has to solve a binding problem. If the speaker is producing the sentence Pages escort kings, at some time the lemmas page and king will be selected. How to prevent the speaker from erroneously producing Kings escort pages? The selection mechanism should, in some way, bind a selected lemma to the appropriate concept. Similarly, at some later stage, the segments of the word forms <king> and <page> are spelled out. How to prevent the speaker from erroneously producing Pings escort cages? The system must keep track of /p/ belonging to pages and /k/ belonging to kings. In most existing models of word access (in particular Dell, 1988; Dell et al., 1993) the binding problem is solved by timing. The activation/deactivation properties of the lexical network guarantee that, usually, the "intended" element is the most activated one at the crucial moment. Exceptions precisely explain the occasional speech errors. Our solution (Roelofs, 1992a, 1993, 1996b, in press-a) is a different one. It follows Bobrow and Winograd's (1977) "procedural attachment to nodes". Each node has a procedure attached to it that checks whether the node, when active, links up to the appropriate active node one level up. This mechanism will, for instance, discover that the activated syllable nodes [p__z] and [k__] do not correspond to the word form nodes <king> and <page>, and hence should not be selected[7]. For example, in the phonological encoding of king, the /k/ but not the /p/ will be selected and syllabified, because /k/ is linked to <king> in the network and /p/ is not. And in phonetic encoding, [k__z] will be selected, because the links in the network between [k__z] and its segments correspond with the syllable positions assigned to these segments during phonological encoding. For instance, /k/ will be syllabified as onset, which corresponds to the link between /k/ and [k__z] in the network. We will call this "binding-by-checking" as opposed to "binding-by-timing".

A major reason for implementing binding-by-checking is the recurrent finding that, during picture naming, distractor stimuli hardly ever induce systematic speech errors. When the speaker names the picture of a king, and simultaneously hears the distractor word page, he or she will neither produce the semantic error page, nor the phonological error ping, although both the lemma page and the phoneme /p/ are strongly activated by the distractor. This fact is more easily handled by binding-by-checking than through binding-by-timing. A perfect binding-by-checking mechanism will, of course, prevent any speech error. A systematic account of speech errors will require our theory to allow for lapses of binding, as in Shattuck-Hufnagel's (1979) "check off" approach.

3.2.4 Relations to the perceptual network

Though distractor stimuli don't induce speech errors, they are highly effective in modulating the speech production process. In fact, since the work by Schriefers et al. (1990), picture-word interference has been one of our main experimental methods. The effectiveness of word primes implicates the existence of activation relations between perceptual and production networks for speech. These relations have traditionally been an important issue in speech and language processing (cf., Liberman ,1996): Are words produced and perceived by the same or by different mechanisms; and if the mechanisms are different, how are they related? We will not take position, except that the feedforward assumption for our form stratum implies that form perception and production cannot be achieved by the same network, because this would require both forward and backward links in the network. An account of the theoretical and empirical motivation of the distinction between form networks for perception and production can be found elsewhere (Roelofs et al., 1996a). Interestingly, proponents of backward links in the form stratum for production (Dell et al., in press) have also argued for the position that the networks are (at least in part) different. Apart from adopting this latter position, we have only made some technical, though realistic, assumptions about the way in which distractor stimuli affect our production network (Roelofs et al., 1996). They are as follows:

Assumption 1 is that a distractor word, whether spoken or written, affects the corresponding morpheme node in the production network. This assumption finds support in evidence from the word perception literature. Spoken word recognition obviously involves phonological activation (McQueen et al., 1995). That visual word processing occurs along both visual and phonological pathways has time and again been argued (e.g. Coltheart et al., 1993; Seidenberg & McClelland, 1989). It is irrelevant here whether the one mediates the other. What matters is that there is phonological activation in visual word recognition. This phonological activation, we assume, directly affects the state of activation of phonologically related morpheme units in the form stratum of the production network.

Assumption 2 is that active phonological segments in the perceptual network can also directly affect the corresponding segment nodes in the production lexicon. This assumption is needed to account for phonological priming effects by non-word distractors (Roelofs, submitted-a).

Assumption 3 is that a spoken or written distractor word can affect corresponding nodes at the lemma level. Because recognizing a word, whether spoken or written, involves accessing its syntactic potential, i.e., the perceptual equivalent of the lemma, we assume activation of the corresponding lemma-level node. In fact, we will shortcut this issue here by assuming that all production lemmas are perceptual lemmas; the perceptual and production networks coincide from the lemma level upwards. But the lemma level is not affected by active units in the form stratum of the production network, whether or not their activation derives from input from the perceptual network; there is no feedback there.

A corollary of these assumptions is that one should expect cohort-like effects in picture-distractor interference. They are of different kinds. First, it follows from assumption 3 that there can be semantic cohort effects of the following type. When the word "accompany" is the distractor, it will activate the joint perception/production lemma accompany (see Figure 2). This lemma will spread activation to the corresponding lexical concept node ACCOMPANY(X,Y) (as it always does in perception). In turn, the concept node will co-activate semantically related concept nodes, such as the ones for ESCORT(X,Y) and SAFEGUARD(X,Y). Second, there is the possibility of phonological cohort effects, both at the form level and at the lemma level. When the target word is "escort" there will be relative facilitation by presenting "escape" as a distractor. This comes about as follows. In the perceptual network "escape" initially activates a phonological cohort that includes the word form and lemma of "escort" (for evidence concerning form activation, see Brown, 1990, and for lemma activation, see Zwitserlood, 1989). According to assumption 1, this will activate the word form node <escort> in the production network. Although there is the possibility that non-word distractors follow the same route (e.g., the distractor "esc" will produce the same initial cohort as "escape"), assumption 2 is needed to account for the facilitating effects of spoken distractors that correspond to a word-final stretch of the target word. Meyer and Schriefers (1991), for instance, obtained facilitation of naming words like "hammer" by presenting a distractor like "summer", which has the same word-final syllable. For all we know, this distractor hardly activates "hammer" in its perceptual cohort. But it will speed up the segmental spellout of all words containing "mer" in the production network. Meyer and Schriefers (1991), see also Roelofs (submitted-a) for related evidence, obtained the same facilitation effect when only the final syllable (i.e., "mer") was used as a distractor.

3.2.5 Ockham's razor

Both the design of our theory and the computational modelling have been guided by Ockham's methodological principle. The game has always been to work from a minimal set of assumptions. Processing stages are strictly serial: there is neither parallel processing nor feedback between lexical selection and form encoding (with the one, still restricted, exception of self-monitoring); there is no free cascading of activation through the lexical network; there are no inhibitory connections in the network; WEAVER++'s few parameters were fixed on the basis of initial data sets and then kept constant throughout all further work (as will be discussed in Sections 5.2 and 6.4). This minimalism did not emanate from an a-priori conviction that our theory is right. It is, rather, entirely methodological. We wanted theory and model to be maximally vulnerable. For a theory to be empirically productive, it should forbid certain empirical outcomes to arise. In fact, a rich and sophisticated empirical search has been arising from our theory's ban on activation spreading from an active but non-selected lemma (see Section 6.1.1) as well as from its ban on feedback from word form encoding to lexical selection (see Section 6.1.2), to give just two examples. On the other hand, we have been careful not to claim superiority for our serial stage reaction time model as compared to alternative architectures of word production on the basis of good old additive factors logic (Sternberg, 1969). Additivity does not uniquely support serial stage models; non-serial explanations of additive effects are sometimes possible (McClelland, 1979; Roberts & Sternberg, 1993). Rather, we had to deal with the opposite problem. How can apparently interactive effects, such as semantic /phonological interaction in picture/word interference experiments (Section 5.2.3) or the statistical overrepresentation of mixed semantic/phonological errors (Section 6.1.2), still be handled in a serial stage model, without recourse to the extra assumption of a feedback mechanism?

4. CONCEPTUAL PREPARATION

4.1 Lexical Concepts as Output

Whatever the speaker tends to express, it should ultimately be cast in terms of lexical concepts, i.e., concepts for which there exist words in the target language. In this sense, lexical concepts form the terminal vocabulary of the speaker's message construction. That terminal vocabulary is, to some extent, language specific (Slobin, 1987; Levelt, 1989). By life-long experience, speakers usually know what concepts are lexically expressible in their language. Our theory of lexical access is not well-developed for this initial stage of conceptual preparation (but see Section 4.2). In particular, the computational model does not cover this stage. But in order to handle the subsequent stage of lexical selection, particular assumptions have to be made about the output of conceptual preparation. Why have we opted for lexical concepts as the terminal vocabulary of conceptual preparation?

It is a classical and controversial issue whether the terminal conceptual vocabulary is a set of lexical concepts or rather the set of primitive conceptual features that make up these lexical concepts. We assume that message elements make explicit the intended lexical concepts (cf., Fodor et al., 1980) rather than the primitive conceptual features that make up these concepts, as is traditionally assumed (e.g., Bierwisch & Schreuder, 1992; Goldman, 1975; Miller & Johnson-Laird, 1976; Morton, 1969). That is, we assume that there is an independent message element that says, for example, ESCORT(X,Y) instead of several elements that say something like IS-TO-ACCOMPANY(X,Y) and IS-TO-SAFEGUARD(X,Y) and so forth. The representation ESCORT(X,Y) gives access to conceptual features in memory such as IS-TO-ACCOMPANY(X,Y) but does not contain them as proper parts (Roelofs, 1997-a). Van Gelder (1990) referred to such representations as "functionally decomposed". Such memory codes, i.e., codes standing for more complex entities in memory, are traditionally called "chunks" (Miller, 1956).

There are good theoretical and empirical arguments for this assumption of chunked retrieval in our theory, which have been reviewed extensively elsewhere (Roelofs, 1992b, 1993, 1996a, and especially 1997a). In general, how information is represented greatly influences how easy it is to use it (cf., Marr, 1982). Any representation makes some information explicit at the expense of information that is left in the background. Chunked retrieval implies a message that indicates which lexical concepts need to be expressed, while leaving their featural composition in memory. Such a message provides the information needed for syntactic encoding, and reduces the computational burden for both the message encoding process and the process of lexical access. Mapping thoughts onto chunked lexical concept representations in message encoding guarantees that the message is ultimately expressible in the target language, and mapping these representations onto lemmas prevents the hyperonym problem (Section 6.3.1) from arising (see Roelofs, 1996a, 1997-a).

4.2 Perspective Taking

Any state of affairs can be expressed in many different ways. Take the scene at the top of Figure 3. Two possible descriptions, among many more, are: I see a chair with a ball

to the left of it, and I see a chair with a ball to the right of it. Hence one can use the converse terms left and right here to refer to the same spatial relation. How come? It all depends on the perspective taken. The expression left of arises when the speaker resorts to "deictic" perspective in mapping the spatial scene onto a conceptual representation, deictic perspective being a three-term relation between the speaker as origin, the relatum (chair) and the referent (ball). But right of results when the speaker interprets the scene from an "intrinsic perspective", a two-point relation where the relatum (chair) is the origin and the referent (ball) relates to the intrinsic right side of the referent. Dependent on the perspective taken, the lexical concept RIGHT or LEFT gets activated (see Figure 3). Both lead to veridical descriptions. Hence, there is no hard-wired relation between the state of affairs and the appropriate lexical concept. Rather, the choice of perspective is free. Various aspects of the scene and the communicative situation make the speaker opt for one perspective or another (see Levelt, 1989, 1996, for reviews and experimental data).

Perspective taking is not just a peculiar aspect of spatial description. Rather, it is a general property of all referring. It is even an essential component in tasks as simple as picture naming. Should the object be referred to as an animal, a horse, or a mare? All can be veridical, but it depends on context which perspective is the most appropriate one. It is a convenient illusion in the picture naming literature that an object has a fixed name. But there is no such thing. Usually, there is only the tacit agreement to use basic level terms (Rosch et al., 1976). Whatever the intricacies of conceptual preparation, the relevant output driving the subsequent steps in lexical access is the active lexical concept.

5. LEXICAL SELECTION

5.1 Algorithm for Lemma Retrieval

The activation of a lexical concept is the proximal cause of lexical selection. How is a content word, or rather lemma (cf., Section 3.1.2) selected from the mental lexicon, given an active lexical concept? A basic claim of our theory is that lemmas are retrieved in a conceptually non-decomposed way. For example, the noun escort is retrieved on the basis of the abstract representation or chunk ESCORT(X,Y) instead of features such as IS-TO- ACCOMPANY(X,Y) and IS-TO-SAFE-GUARD(X,Y). Retrieval starts by enhancing the level of activation of the node of the target lexical concept. Activation then spreads through the network, each node sending a proportion of its activation to its direct neighbors. The most highly activated lemma node is selected when verification allows. For example, in verbalizing "escort", the activation level of the lexical concept node ESCORT(X,Y) is enhanced. Activation spreads through the conceptual network and down to the lemma stratum. As a consequence, the lemma nodes escort and accompany will be activated. The escort node will be the most highly activated node, because it receives a full proportion of ESCORT(X,Y)'s activation, whereas accompany and other lemma nodes only receive a proportion of a proportion. Upon verification of the link between the lemma node of escort and ESCORT(X,Y), this lemma node will be selected. The selection of function words also involves lemma selection; each function word has its own lemma, i.e., its own syntactic specification. Various routes of lemma activation are open here. Many function words are selected in just the way described for selecting escort because they can be used to express semantic content. That is often the case for the use of prepositions, such as up or in. But the same prepositions can also function as part of particle verbs (as in look up, or believe in). Here they have no obvious semantic content. Section 5.3 will discuss how such particles are accessed in the theory. The lemmas of still other function words are activated as part of a syntactic procedure, for instance that in the earlier example "John said that ...". Here we will not discuss this "indirect election" of lemma's (but see Levelt, 1989).

The equations that formalize WEAVER++ are given in Roelofs (1992a, 1992b, 1993, 1994, 1996b, in press-a). The appendix of the current paper gives an overview. There are simple equations for the activation dynamics and the instantaneous selection probability of a lemma node, that is, the hazard rate of the lemma retrieval process. The basic idea is that, for any smallest time interval, given that the selection conditions are satisfied, the selection probability of a lemma node equals the ratio of its activation to that of all the other lemma nodes (the "Luce ratio"). Given the selection ratio, the expectation of the retrieval time can be computed.

5.2 Empirical RT Support

5.2.1 SOA curves of semantic effects

The retrieval algorithm explains, among other things, the classical curves of the semantic effect of picture and word distractors in picture naming, picture categorizing, and word categorizing. The basic experimental situation for picture naming is as follows. Participants have to name pictured objects while trying to ignore written distractor words superimposed on the pictures or spoken distractor words. For example, they have to say "chair" to a pictured chair and ignore the distractor word "bed" (semantically related to target word "chair") or "fish" (semantically unrelated). In the experiment, one can vary the delay between picture onset and distractor onset, the so-called "stimulus onset asynchrony" (SOA). The distractor onset can, typically, be at 400, 300, 200, 100 ms before picture onset (negative SOAs), simultaneously with, or at 100, 200, 300, 400 ms after picture onset (positive SOAs). The classical finding is shown in panel A of Figure 4. It is the SOA curve obtained by Glaser and Düngelhoff (1984), where the distractors were visually presented words. It shows a semantic effect (i.e., the difference between the naming latencies with semantically related and unrelated distractors) for different SOAs. Thus, a positive difference indicates a semantic inhibition effect. Semantic inhibition is obtained at SOA -100, 0, and 100 ms.

Before discussing these and the other data in Figure 4, we must present some necessary details about how WEAVER++ was fit to these data. The computer simulations of lemma retrieval in picture naming, picture categorizing, and word categorizing experiments were run with both small and larger lexical networks. The small network (see Figure 5)

included the nodes that were minimally needed to simulate the conditions in the experiments. To examine whether the size of the network influenced the outcomes, the simulations were run using larger networks of either 25 or 50 words that contained the small network as a proper part. The small and larger networks produced equivalent

outcomes.

All simulations were run using a single set of seven parameters whose values were held constant across simulations: (1) a real-time value in milliseconds for the smallest time interval (time step) in the model, (2) values for the general spreading rate at the

conceptual stratum and (3) at the lemma stratum, (4) the decay rate, (5) the strength of the distractor input to the network, (6) the time interval during which this input was provided, and (7) a selection threshold. The parameter values were obtained by optimizing the goodness of fit between the model and a restricted number of data sets from the literature, other known data sets were subsequently used to test the model with these parameter values.

The data sets used to obtain the parameter values concerned the classical SOA curves of the inhibition and facilitation effects of distractors in picture naming, picture categorizing and word categorizing; they are all from Glaser and Düngelhoff (1984). Panels A, B and C of Figure 4 present these data sets (in total 27 data points) and the fit of the model. In estimating the 7 parameters from these 27 data points, parameters (1) to (5) were constrained to be constant across tasks, while parameters (6) and (7) were allowed to differ between tasks to account for task changes (i.e., picture naming, picture categorizing, word categorizing). Thus, WEAVER++ has significantly fewer degrees of freedom than the data contain. A goodness of fit statistic adjusted for the number of estimated parameter values showed that the model fit the data. (The adjustment "punished" the model for the estimated parameters.)

After fitting the model to the data of Glaser and Düngelhoff, the model was tested on other data sets in the literature and in new experiments specifically designed to test nontrivial predictions of the model. The parameter values of the model in these tests were identical to those in the fit of Glaser and Düngelhoff's data. Panels D, E and F of Figure 4 present some of these new data sets together with the predictions of the model. Note that WEAVER++ is not too powerful to be falsified by the data. In the graphs presented in Figure 4, there are 36 data points in total, 27 of which were simultaneously fit by WEAVER++ with only seven parameters; for the remainder no further fitting was done, except that parameter (7) was fine-tuned between experiments. So, there are substantially more empirical data points than there are parameters in the model. The fit of the model to the data is not trivial.

We will now discuss the findings in each of the panels of Figure 4 and indicate how WEAVER++ accounts for the data. As in any modeling enterprise, a distinction can be made between empirical phenomena that were specifically built into the model and phenomena that the model predicts but that had not been previously explored. For example, the effects of distractors are inhibitory in picture naming (Panel A of Figure 4) but they are facilitatory in picture and word categorizing (Panels B and C). This phenomenon was built into the model by restricting the response competition to permitted response words, which yields inhibition in naming but facilitation in categorizing, as we will explain below. Adopting this restriction led to predictions that had not been tested before. These predictions were tested in new experiments, the results of some of which are shown in Panels D to F of Figure 4. How does WEAVER++ explain the picture naming findings in panel A? We will illustrate the explanation using the miniature network given in Figure 5 (larger networks yield the same outcomes). The figure illustrates the conceptual stratum and the lemma stratum of two semantic fields, furniture and animals. Thus, there are lexical concept nodes and lemma nodes. It is assumed here that, in this task, presenting the picture activates the corresponding basic level concept (but see Section 4.2 above). Following the assumptions in Section 3.4, we suppose that distractor words have direct access to the lemma stratum. Now assume "chair" is the target. All distractors are names of other pictures in the experiment. In case of a pictured chair and distractor "bed", activation from the picture and the distractor word will converge on the lemma of the distractor "bed", due to the connections at the conceptual stratum. In case of the unrelated distractor "fish" there will be no such convergence. Although the distractor "bed" will also activate the target lemma chair (via the concept nodes BED(X) and CHAIR(X)), the pictured chair will prime the distractor lemma bed more than the distractor word "bed" will prime the target lemma chair, due to network distances: three links versus four links (pictured chair ø CHAIR(X) ø BED(X) ø bed versus word "bed"ø bed ø BED(X) ø CHAIR(X)ø chair). Consequently, it will take longer before the activation of chair exceeds that of bed than that of fish. Therefore, bed will be a stronger competitor than fish, which results in the semantic inhibition effect.

Let us now consider the panel B results. It is postulated in WEAVER++ that written distractors are only competitors when they are permitted responses in an experiment (i.e., when they are part of the response set). In case of picture or word categorization, furniture and animal instead of chair, bed, or fish are the targets. Now the model predicts a semantic facilitation effect. For example, the distractor "bed" will prime the target furniture, but will not be a competitor itself because it is not a permitted response in the experiment. By contrast, "fish" on a pictured chair will prime animal, which is a competitor of the target furniture. Thus, semantic facilitation is predicted, and this is also what is empirically obtained. Panel B of Figure 4 gives the results for picture categorizing (for example, when participants have to say "furniture" to the pictured bed and ignore the distractor word). Again, the semantic effect is plotted against SOA. A negative difference indicates a semantic facilitation effect. The data are again from Glaser and Düngelhoff (1984). WEAVER++ fits the data well.

Following the same reasoning, the same prediction holds for word categorizing, for example, when participants have to say "furniture" when they see the printed word "bed" but have to ignore the picture behind it. Panel C of Figure 4 gives the results for word categorizing. Again, WEAVER++ fits the data.

Still another variant is picture naming with hyperonym, cohyponym, and hyponym distractors superimposed. As long as these distractors are not part of the response set, they should facilitate naming relative to unrelated distractors. For example, in naming a pictured chair (the only picture of a piece of furniture in the experiment), the distractor words "furniture" (hyperonym), "bed" (cohyponym), or "throne" (hyponym) are superimposed. Semantic facilitation was indeed obtained in such an experiment (Roelofs, 1992a, 1992b). Panel D of Figure 4 plots the semantic facilitation against SOA. The semantic effect was the same for hyperonym, cohyponym and hyponym distractors. The curves represent means across these types of word. The findings concerning the facilitation effect of hyponym distractors exclude one particular solution to the hyp(er)onymy problem in lemma retrieval. Bierwisch and Schreuder (1992) have proposed that the convergence problem is solved by inhibitory links between hyponyms and hyperonyms in a logogen type system. However, this predicts semantic inhibition from hyponym distractors, but facilitation is what you obtain.

The WEAVER++ model is not restricted to the retrieval of noun lemmas. Thus, the same effects should be obtained in naming actions using verbs. For example, ask participants to say "drink" to the picture of a drinking person (notice the experimental induction of perspective taking) and to ignore the distractor words "eat" or "laugh" (names of other actions in the experiment). Indeed, again semantic inhibition is obtained in that experiment, as shown in Panel F of Figure 4 (Roelofs, 1993). Also facilitation is again predicted for hyponym distractors that are not permitted responses in the experiment. For instance, the participants have to say "drink" to a drinking person and ignore "booze" or "whimper" (not permitted responses in the experiment) as distractors. Semantic facilitation is indeed obtained in this paradigm, as shown in Panel E of Figure 4 (Roelofs, 1993).

In summary, the predicted semantic effects have been obtained for nouns, verbs, and adjectives (e.g., color, which is the classical Stroop effect), not only in producing single words (e.g., Glaser & Glaser, 1989; Roelofs, 1992a, 1992b, 1993), but also for lexical access in producing phrases, as has been shown by Schriefers (1993). To study semantic (and phonological) priming in sentence production, Meyer (1996) used auditory primes and found semantic inhibition, although the distractors were not in the response set. In an as yet unpublished study, Roelofs obtained semantic facilitation from written distractor words, but semantic inhibition when the same distractor words were presented auditorily. Why it is, time and again, hard to obtain semantic facilitation from auditory distractors is still unexplained.

5.2.2 Semantic versus conceptual interference

One could ask whether the semantic effects reported in the previous section could not be explained by access to the conceptual stratum. In other words, are they properties of lexical access proper? They are; the semantic effects are only obtained when the task involves producing a verbal response. In a control experiment carried out by Schriefers et al. (1990), participants had to categorize pictures as "old" or "new" by pressing one of two buttons, i.e., they were not naming the pictures. In a preview phase of the experiment, the participants had seen half of the pictures. Spoken distractor words were presented during the old/new categorization task. In contrast with the corresponding naming task, no semantic inhibition effect was obtained. This suggests that the semantic interference effect is due to lexical access rather than to accessing conceptual memory. Of course, these findings do not exclude interference effects at the conceptual level. Schriefers (1990) asked participants to refer to pairs of objects by saying whether an object marked by a cross was bigger or smaller than the other, i.e., the subject produced the verbal response "bigger" or "smaller". But there was an additional variable in the experiment: Both objects could be relatively large, or both could be relatively small. Hence, not only relative size, but also absolute size was varied. In this relation naming task, a congruency effect was obtained. Participants were faster in saying "smaller" when the absolute size of the objects was small than when it was big, and vice versa. In contrast to the semantic effect of distractors in picture naming, this congruency effect was a concept level effect. The congruency effect remained when the participants had to press one button when the marked object was taller and another button when it was shorter.

5.2.3 Interaction between semantic and orthographic factors

Starreveld and La Heij (1995; see also Starreveld and La Heij, 1996) observed that the semantic inhibition effect in picture naming is reduced when there is an orthographic relationship between target and distractor. For example, in naming a picture of a cat, the semantic inhibition was less for distractor "calf" compared to "cap" (orthographically related to "cat") than for distractor "horse" compared to "house". According to Starreveld and La Heij, this interaction suggests that there is feedback from the word form level to the lemma level, i.e., from word forms <calf> and <cap> to lemma cat, contrary to our claim that the word form network contains forward links only. However, as we have argued elsewhere (Roelofs et al. 1996; see also Section 3.2.4), Starreveld and La Heij overlooked that printed words activate their lemma nodes and word form nodes in parallel in our theory (cf., Section 3.2.4). Thus, printed words may affect lemma retrieval directly, and there is no need for backward links from word form nodes to lemmas in the network. Computer simulations showed that WEAVER++ predicts that in naming a pictured cat, the semantic inhibition will be less for distractor "calf" compared to "cap" than for distractor "horse" compared to "house", as empirically observed.

5.3 Accessing Morphologically Complex Words

There are different routes for a speaker to generate morphologically complex words, depending on the nature of the word. We distinguish four cases, depicted in Figure 6

5.3.1 The degenerate case

Some words may linguistically count as morphologically complex, but are not psychologically. An example is replicate, which historically has a morpheme boundary between re and plicate. That this is not any more the case appears from the word's syllabification: rep-li-cate (which even violates maximization of onset). Normally, the head morpheme of a prefixed word will behave as a phonological word (4) itself, hence syllabification will respect its integrity. This is not the case in replicate, where p syllabifies with the prefix (note that it still is the case in re-ply, which has the same latinate origin, re-plicare). Such words are monomorphemic for all processing means and purposes (Figure 6a).

5.3.2 The single-lemma-multiple-morpheme case

This is the case depicted in Figure 6b and in Figure 2. The word escorting is generated from a single lemma escort that is marked for +progressive. It is only at the word form level that two nodes are involved, one for <escort> and the other one for <ing>. Regular inflections are probably all of this type. But irregular verb inflections are not, usually. The lemma go+past will activate the one morpheme <went>. Although inflections for number will usually go with the regular verb inflections, there are probably exceptions here - see Section 5.3.5. The case is more complicated for complex derivational morphology. Most of the frequently used compounds are of the type discussed here. For example, blackboard, sunshine, hotdog, and offset are most likely single lemma items, though thirty-nine and complex numbers in general (cf., Miller, 1991) may not be. Words with bound derivational morphemes form a special case. These morphemes typically change the word's syntactic category. But syntactic category is a lemma-level property. The simplest story, therefore, is to consider them to be single-lemma cases, carrying the appropriate syntactic category. This won't work though for more productive derivation, to which we will shortly return.

5.3.3 The single-concept-multiple-lemma case

The situation in Figure 6c is best exemplified by the case of particle verbs. A verb such as "look up", is represented by two lemma nodes in our theory and computational model (Roelofs, in press-b). Particle verbs are not words but minimal verb projections (Booij, 1995). Given that the semantic interpretation of particle verbs is often not simply a combination of the meanings of the particle and the base (hence they don't stem from multiple concepts), the verb-particle combinations have to be listed in the mental lexicon. In producing a verb-particle construction, the lexical concept selects for a pair of lemma nodes from memory and makes them available for syntactic encoding processes. Some experimental evidence on the encoding of particle verbs will be presented in Section 6.4.4.

A very substantial category of this type is formed by idioms. The production of "kick the bucket" probably derives from activating a single, whole lexical concept, which in turn selects for multiple lemmas (cf., Everaerd, van der Linden, & Schreuder, 1995).

5.3.4 The multiple-concept case

This case, represented in Figure 6d, includes all derivational new-formations. Clearest here are newly formed compounds, the most obvious case being complex numbers. At the conceptual level the number 1007 is probably a complex conceptualization, with the lexical concepts 1000 and 7 as terminal elements. These, in turn, select for the lemmas thousand and seven, respectively. The same process is probably involved in generating other new compounds, for example when a creative speaker produced the word sitcom for the first time. There are still other derivational new-formations, those with bound morphology, that seem to fit this category. Take very low-frequency X-ful words, such as bucketful. Here, the speaker may never have heard or used the word before, hence doesn't yet have a lemma for it. There are probably two active lexical concepts involved here, BUCKET and something like FULL, each selecting for its own lemma. Semantics is clearly compositional in such cases. Productive derivational uses of this type require the bound morpheme at the lemma level to determine the word's syntactic category during the generation process.

Do these four cases exhaust all possibilities in the generation of complex morphology? It doesn't seem so, as will appear in the following section.

5.3.5 Singular- and plural-dominant nouns

In an as yet unpublished study, Baayen, Levelt and Haveman asked subjects to name pictures containing one or two identical objects, and to use singular or plural, respectively. The depicted objects were of two kinds. The first type, so-called singular dominants, were objects whose name was substantially more frequent in the singular than in the plural form. An example is "nose", where nose is more frequent than noses. For the second type, the so-called plural dominants, the situation was reversed, the plural being more frequent than the singular. An example is "eye", with eyes more frequent than eye. The upper panel of Figure 7 presents the naming latencies for relatively high-frequency singular and plural dominant words.

These results display two properties, one of them remarkable. The first one is a small, but significant longer latency for plurals than for singulars. That was expected, because of greater morphological complexity. The remarkable finding is that both the plural dominant singulars (such as eye) and the plural dominant plurals (such as eyes) were significantly slower than their singular dominant colleagues, although the stem frequency was controlled to be the same for the plural and the singular dominants. Also, there was no interaction. This indicates, first, that there was no surface frequency effect - the relatively high-frequency plural dominant plurals had the longest naming latencies. Since the surface frequency effect originates at the word form level, as will be shortly discussed in Section 6.1.3, a word's singular and plural are likely to access the same morpheme node at the word form level. More enigmatic is why plural-dominants are so slow. A possible explanation is depicted in Figure 7, panels B and C. The "normal" case is singular dominants. In generating the plural of "nose", the speaker first activates the lexical concepts NOSE and something like MULTIPLE. Together, they select for the one lemma nose, with diacritic feature "pl". The lemma with its plural feature then activates the two morpheme nodes <nose> and <-_z>, following the single-lemma-multiple-morpheme case of Section 5.3.2. But the case may be quite different for plural dominants, such as "eye". Here there are probably two different lexical concepts involved in the singular and the plural. The word "eyes" is not just the plural of "eye", there is also some kind of meaning difference:"eyes" has the stronger connotation of "gaze". And similar shades of meaning variation exist between "ears" and "ear", "parents" and "parent", etc. This is depicted in Panel C of Figure 7. Accessing the plural word "eyes" begins by accessing the specific lexical concept EYES. This selects for its own lemma eyes (with a diacritic plural feature). This in turn activates morphemes <eye> and <z> at the word form level. Singular "eye" is similarly generated from the specific lexical concept EYE. It selects for its own (singular) lemma eye. From here activation converges on the morpheme <eye> at the word form level.

How do the diagrams in Panels B and C account for the experimental findings? For both the singular and plural dominants the singular and plurals converge on the same morpheme at the word form level. This explains the lack of a surface frequency effect. That the plural dominants are relatively slow, for both the singular and the plural follows from the main lemma selection rule, discussed in Section 5.1. The semantically highly related lexical concepts EYE and EYES will always be co-activated, whichever is the target. As a consequence, both lemmas eye and eyes will receive activation, whichever is the target. The lexical selection rule then predicts relatively long selection latencies for both the singular and the plural lemma (following Luce's rule), because of competition between active lemmas. This is not the case for selecting nose - there is no competitor there.

In conclusion, the generation of complex morphology may involve various levels of processing, dependent on the case at hand. It will always be an empirical issue to determine what route is followed by the speaker in any concrete instance.

5.4 Accessing Lexical Syntax and the Indispensability of the Lemma Level

A core feature of the theory is that lexical selection is conceived of as selecting the syntactic word. What the speaker selects from the mental lexicon is an item that is just sufficiently specified to function in the developing syntax. To generate fluent speech incrementally, the first bit of lexical information needed is the word's syntax. Accessing word form information is less urgent in the process (cf., Levelt, 1989). But what evidence do we have that lemma and word form access are really distinct operations?

5.4.1 Tip-of-the-tongue states

Recent evidence supporting the distinction between a lemma and form level of access comes from the tip-of-the-tongue phenomenon. As mentioned above (Section 3.1.3) Italian speakers in tip-of-the-tongue states most of the time know the grammatical gender of the word, a crucial syntactic property in the generation of utterances (Vigliocco et al., 1997). However, they know the form of the word only partially or not at all. The same has been shown for an Italian anomic patient (Badecker et al., 1995), confirming earlier evidence for French anomic patients (Henaff-Gonon et al., 1989). This shows that lemma access can succeed where form access fails.

5.4.2 Agreement in producing phrases

A further argument for the existence of a distinct syntax accessing operation proceeds from gender priming studies. Schriefers (1993) asked Dutch participants to describe coloured pictured objects using phrases. For example, they had to say de groene tafel ("the green table") or groene tafel ("green table"). In Dutch, the grammatical gender of the noun (non-neuter for tafel - "table") determines which definite article should be chosen (de for non-neuter and het for neuter) and also the inflection on the adjective (groene or groen - "green"). On the pictured objects, written distractor words were superimposed that were either gender congruent or incongruent with the target. For example, the distractor muis - "mouse" takes the same non-neutral gender as the target tafel - "table", whereas distractor hemd - "shirt" takes neuter gender. Schriefers obtained a gender congruency effect, as predicted by WEAVER++. Smaller production latencies were obtained when the distractor noun had the same gender as the target noun compared to a distractor with a different gender (see also Van Berkum, 1996, in press). According to WEAVER++, this gender congruency effect should only be obtained when agreement has to be computed, that is, when the gender node has to be selected in order to choose the appropriate definite article or the gender marking on the adjective, but not when participants have to produce bare nouns, that is, in "pure" object naming. WEAVER++ makes a distinction between activation of the lexical network and the actual selection of nodes. All noun lemma nodes point to one of the grammatical gender nodes (two in Dutch), but there are no backward pointers (see Figure 1). Thus, boosting the level of activation of the gender node by a gender-congruent distractor will not affect the level of activation of the target lemma node and therefore will not influence the selection of the lemma node. Consequently, priming a gender node will only affect lexical access when the gender node itself has to be selected. This is the case when the gender node is needed for computing agreement between adjective and noun. Thus, the gender congruency effect should only be obtained in producing gender-marked utterances, not in producing bare nouns. This corresponds to what is empirically observed (Jescheniak, 1994).

5.4.3 A short-lived frequency effect in accessing gender

A further argument for an independent lemma representation derives from experiments by Jescheniak and Levelt (1994; Jescheniak, 1994). They demonstrated that when lemma information such as grammatical gender is accessed, an idiosyncratic frequency effect is obtained. Dutch participants had to decide on the gender of a picture's name (e.g., they had to decide that the grammatical gender of tafel - "table" is non-neuter), which was done faster for high-frequency words than for low-frequency ones. The effect quickly disappeared over repetitions, contrary to a "robust" frequency effect obtained in naming the pictures (to be discussed in Section 6.1.3 below). In spite of substantial experimental effort (van Berkum, 1996, in press), the source of this short-lived frequency effect has not been discovered. What matters here, however, is that gender and form properties of the word bear markedly different relations to word frequency.

5.4.4 Lateralized readiness potentials

Exciting new evidence for the lemma/word form distinction in lexical access stems from a series of experiments by van Turennout et al. (1997, in preparation). The authors measured event related potentials in a situation where the participants named pictures. On the critical trials, a gender/segment classification task was to be performed before naming, which made it possible to measure lateralized readiness potentials (LRPs, cf., Coles et al., 1988; Coles, 1989). This classification task consisted of a conjunction of a pushbutton response with the left or right hand and a go-no/go decision. In one condition, the decision whether to give a left or right hand response was determined by the grammatical gender of the picture name (e.g., respond with the left hand if the gender is non-neuter and with the right hand if it is neuter). The decision whether or not to carry out the response was determined by the first segment of the picture name (e.g., respond if the first segment is /b/, otherwise do not respond). So, if the picture was one of a bear (Dutch "beer" with non-neutral gender) the participants responded with their left hand; if the picture was one of a wheel (Dutch "wiel" with neutral gender) they did not respond. The measured LRPs show whether the participants prepared for pushing the correct button not only on the go-trials but also on the nogo-trials. For example, the LRPs show whether there is response preparation for a picture whose name does not start with the critical phoneme. When gender determined the response hand and the segment determined whether to respond, the LRP showed preparation for the response hand on both the go- and the nogo-trials. However, in a condition where the situation was reversed, that is, where the first segment determined the response hand and the gender determined whether to respond or not, the LRP showed preparation for the response hand on the go-trials but not on the nogo-trials.

These findings show that in accessing lexical properties in production, you can access a lemma property, gender, and halt there before beginning to prepare a response to a word form property of the word. But the reverse is not possible. In this task you will have accessed gender before you access a form property of the word. Again these findings support the notion that a word's lexical syntax and its phonology are distinct representations that can be accessed in this temporal order only. In other experiments, the authors showed that onsets of LRP preparation effects in monitoring word onset and word offset consonants (e.g., /b/ versus /r/ in target bear) differed by 80 ms on average. This gives an indication of the speed of phonological encoding, to which we will return in Section 9.

5.4.5 Evidence from speech errors

The findings discussed so far in this section support the notion that accessing lexical syntax is a distinct operation in word access. A lemma level of word encoding explains semantic interference effects in the picture-word interference paradigm, findings on tip-of-the-tongue states, gender congruency effects in computing agreement, specific frequency effects in accessing gender information, and event related potentials in accessing lexical properties of picture names.

Although our theory has (mostly) been built upon such latency data, this section would not be complete without referring to the classical empirical support for a distinction between lemma retrieval and word-form encoding coming from speech errors. A lemma level of encoding explains the different distribution of word and segment exchanges. Word exchanges, such as the exchange of roof and list in we completely forgot to add the list to the roof (from Garrett, 1980), typically concern elements from different phrases and of the same syntactic category (here: noun). By contrast, segment exchanges, such as rack pat for pack rat (from Garrett, 1988), typically concern elements form the same phrase and do not respect syntactic category. This finding is readily explained by assuming lemma retrieval during syntactic encoding and segment retrieval during subsequent word-form encoding.

Speech errors also provide support for a morphological level of form encoding that is distinct from a lemma level with morphosyntactic parameters. Some morphemic errors appear to concern the lemma level, whereas others involve the form level (e.g., Dell, 1986; Garrett, 1975, 1980, 1988). For example, in how many pies does it take to make an apple? (from Garrett, 1988), the interacting stems belong to the same syntactic category (i.e., noun) and come from distinct phrases. Note that the plurality of apple is stranded, that is, it is realized on pie. Thus, the number parameter is set after the exchange. The distributional properties of these morpheme exchanges are similar to those of whole-word exchanges. This suggests that these morpheme errors and whole-word errors occur at the same level of processing, namely when lemmas in a developing syntactic structure trade places. By contrast, the exchanging morphemes in an error such as slicely thinned (from Stemberger, 1985b) belong to different syntactic categories (adjective and verb) and come from the same phrase, which is also characteristic of segment exchanges. This suggests that this second type of morpheme error and segment errors occur at the same level of processing, namely the level at which morphemes and segments are retrieved and the morpho-phonological form of the utterance is constructed. The errors occur when morphemes in a developing morpho-phonological structure trade places.

The sophisticated statistical analysis of lexical speech errors by Dell and colleagues (Dell, 1986; 1988) has theoretically always involved a level of lemma access, distinct from a level of form access. Recently, Dell et al. (in press) performed an extensive picture naming study on 23 aphasic patients and 60 matched normal controls, analyzing the spontaneous lexical errors produced in this task. For both normals and patients a perfect fit was obtained with a two-level spreading activation model, i.e., one that distinguishes a level of lemma access. Although the model differs from WEAVER++ in other respects, there is no disagreement about the indispensability of a lemma stratum in the theory.

6. MORPHOLOGICAL AND PHONOLOGICAL ENCODING

After having selected the appropriate lemma, the speaker is in the starting position to encode the word as a motor action. Here the functional perspective is quite different from the earlier move towards lexical selection. In lexical selection the job is to select the one appropriate word from among tens of thousands of lexical alternatives. But in preparing an articulatory action, lexical alternatives are irrelevant; there is only one pertinent word form to be encoded. What counts is context. The task is to realize the word in its prosodic environment. The dual function here is for the prosody to be expressive of the constituency in which the word partakes and to optimize pronounceability. One aspect of expressing constituency is marking the word as a lexical head in its phrase. This is done through phonological phrase construction, which will not be discussed here (but see Levelt, 1989). An aspect of optimizing pronounceableness is syllabification in context. This is, in particular, achieved through phonological word formation, as we introduced in Section 3.1.3. Phonological word formation is a central part of the present theory, to which we will shortly return. But the first move in morpho-phonological encoding is to access the word's phonological specification in the mental lexicon.

6.1 Accessing Word Forms

6.1.1 The accessing mechanism

Given the function of word form encoding, it would appear counterproductive to activate the word forms of all active lemmas that are not selected[8]. After all, their activation can only interfere with the morpho-phonological encoding of the target or, alternatively, there should be special, built-in mechanisms to prevent this - a curiously baroque design. In Levelt et al. (1991a) we therefore proposed the following principle:

Only selected lemmas will become phonologically activated.

Whatever the face value of this principle, it is obviously an empirical issue. Levelt et al. (1991a) put it to test in a picture naming experiment. Subjects were asked to name a series of pictures. On about one third of the trials an auditory probe was presented 73 ms after picture onset. The probe could be a spoken word or a non-word, and the subject had to make a lexical decision on the probe stimulus by pushing one of two buttons; the reaction time was measured. In the critical trials, the probe was a word and it could be an identical, a semantic, a phonological or an unrelated probe. For example, if the picture was one of a sheep, the identical probe was the word sheep and the semantic probe was goat. The critical probe was the phonological one. In a preceding experiment, we had shown that, under the same experimental conditions, a phonological probe related to the target, such as sheet in the example, showed a strong latency effect in lexical decision, testifying to the phonological activation of the target word, the picture name sheep. But in this experiment we wanted to test whether a semantic alternative, such as goat, showed any phonological activation. Hence, we now used a phonological probe related to that semantic alternative. In the example that would be the word goal, which is phonologically related to goat. The unrelated probe, finally, had no semantic or phonological relation to the target or its semantic alternatives. Figure 8 shows the main findings of this experiment.

Both the identical and semantic probes are significantly slower in lexical decision than the unrelated probes. But the phonological distractor, related to the (active) semantic alternative, shows not the slightest effect. This is in full agreement with the above activation principle. A non-selected semantic alternative stays phonologically inert. This case exemplifies the Ockham's razor approach discussed in Section 3.2.5. The theory forbids something to happen, and that is put to test. A positive outcome of this experiment would have falsified the theory.

There have been two kinds of reaction to the principle and to our empirical evidence in its support. The first one was computational, the second one experimental. The computational reaction, by Harley (1993), addressed the issue whether this null-result could be compatible with a connectionist architecture in which activation cascades, independent of lexical selection. We had, on various grounds, argued against such an architecture. The only serious argument in favor of interactive activation models had been their ability to account for a range of speech error phenomena, in particular the alleged statistical overrepresentation of so-called mixed errors, i.e., errors that are both semantically and phonologically related to the target (e.g., a speaker happens to say rat instead of cat). In fact, Dell's (1986) original model was, in part, designed to explain precisely this fact in a simple and elegant way. Hence, we concluded our paper with the remark that, maybe, it is possible to choose some connectionist model's parameters in such a way that it can both be reconciled with our negative findings and still account for the crucial speech error evidence. Harley (1993) took up that challenge and showed that his connectionist model (which differs rather substantially from Dell's, in particular in that it has inhibitory connections both within and between levels) can be parameterized in such a way as to produce our null-effect and still account - in principle - for the crucial mixed errors. That is an existence proof, and we accept it. But it doesn't convince us that it is the way to go theoretically. The model precisely has the baroque properties mentioned above. It first activates the word forms of all semantic alternatives and then actively suppresses this activation by mutual inhibition. Again, the only serious reason for such a design is the explanation of speech error statistics and we will return to that argument below.

The experimental reaction has been a head-on attack on the principle, i.e., to show that active semantic alternatives are phonologically activated. In a remarkable paper, Peterson and Savoy (in press) demonstrated this to be the case for a particular class of semantic alternatives, namely (near-)synonyms. Peterson and Savoy's method was similar to ours in 1991, but they replaced lexical decision by word naming. Subjects were asked to name a series of pictures. But in half the cases they had to perform a secondary task. In these cases, a printed word appeared in the picture shortly after picture onset (at different stimulus onset asynchronies, SOAs) and the secondary task was to name that printed word. That distractor word could be semantically or phonologically related to the target picture name or phonologically related to a semantic alternative. And there were controls, distractors that were neither semantically nor phonologically related to target or alternative. In a first set of experiments, Peterson and Savoy used synonyms as semantic alternatives. For instance, the subject would see a picture of a couch. Most subjects call this a couch, but a minority calls it a sofa. Hence, there is a dominant and a subordinate term for the same object. That was true for all 20 critical pictures in the experiment. On average, the dominant term was used 84% of the time. Would the subordinate term (sofa in the example) become phonologically active at all, maybe as active as the dominant term? In order to test this, Peterson and Savoy used distractors that were phonologically related to the subordinate term (e.g. soda for sofa) and compared their behavior to distractors related to the target (e.g. count for couch). The results were unequivocal. For SOAs ranging from 100 to 400 ms, the naming latencies for the two kinds of distractor were equally, and substantially, primed. Only at SOA=600 ms the subordinate's phonological priming disappeared. This clearly violates the principle: Both synonyms are phonologically active, not just the preferred one (i.e., the one that the subject was probably preparing) and initially they are equally active.

In a second set of experiments, Peterson and Savoy tested the phonological activation of non-synonymous semantic alternatives, such as bed for coach (here the phonological distractor would be bet). This, then, was a straight replication of our experiment. And so were the results. There was not the slightest phonological activation of these semantic alternatives, just as we had found. Peterson and Savoy's conclusion was that there was only multiple phonological activation of actual picture names. Still, as Peterson and Savoy argue, that finding alone is problematic for the above principle and supportive of cascading models.

Recently, Jescheniak and Schriefers (submitted) independently tested the same idea in a picture-word interference task. When the subject was naming a picture (for instance of a couch) and received a phonological distractor word related to a synonym (for instance soda), there was measurable interference with naming. The naming latency was longer in this case than when the distractor was unrelated to the target or its synonym (for instance figure). This supports Peterson and Savoy's findings.

What are we to make of this? Clearly, our theory has to be modified, but how? There are several ways to go. One is to give up the principle entirely. But that would be an overreaction, given the fact that multiple phonological activation has only been shown to exist for synonyms. Any other semantic alternative that is demonstrably semantically active has now been repeatedly shown to be phonologically entirely inert. One can argue that it is phonologically active nevertheless, as both Harley and Peterson & Savoy do, but just unmeasurably so. Our preference is a different tack. In his account of word blends, Roelofs (1992a) suggested that "they might occur when two lemma nodes are activated to an equal level, and both get selected ... the selection criterion in spontaneous speech (i.e., select the highest activated lemma node of the appropriate syntactic category) is satisfied simultaneously by two lemma nodes... This would explain why these blends mostly involve near-synonyms...". The same notion can be applied to the findings under discussion. In the case of near-synonyms it will often be the case that both lemmas are activated to a virtually equal level. Especially under time pressure, the indecision will be solved by selecting both lemmas[9]. Following the above principle, this will then lead to activation of both word forms. If both lemmas are indeed about equally active (i.e., have about the same word frequency, as was indeed the case for Peterson and Savoy's materials, one would expect that, upon their joint selection, both word forms will be equally activated as well. And this is exactly what Peterson and Savoy showed to be the case for their stimuli. Initially, for SOAs of 50 to 400 ms., the dominant and subordinate word forms were equally active indeed. Only by SOA = 600 ms, did the dominant word form take over[10].

Is multiple selection necessarily restricted to near-synonyms? There is no good reason to suppose it is. Peterson and Savoy talk about multiple activation of "actual picture names". We rather propose the notion "appropriate picture names". As we discussed in Section 4.2, what is appropriate depends on the communicative context. There is no hard-wired connection between percepts and lexical concepts. It may, under certain circumstances, be equally appropriate to call an object either flower or rose. In that case, the two lemmas will compete for selection although they are not synonyms, and multiple selection may occur.

A final recent argument for activation spreading from non-selected lemmas stems from a study by Cutting & Ferreira (submitted). In their experiment subjects named pictures of objects whose names were homophones, such as a (toy) ball. When an auditory distractor was presented with a semantic relation to the other meaning of the homophone, such as "dance" in the example, picture naming got facilitated. The authors' interpretation is that the distractor ("dance") activates the alternative (social event) ball lemma in the production network. This lemma, in turn, spreads activation to the shared word form <ball> and hence facilitates naming of the "ball" picture. In other words, not only the selected ball1 lemma, but also the non-selected ball2 sends activation to the shared <ball> word form node. These nice findings, however, do not exclude another possible explanation. The distractor "dance" will semantically and phonologically co-activate its associate "ball" in the perceptual network. Given assumption 1 in Section 3.2.4, this will directly activate the word form node in the production lexicon.

6.1.2 Do selected word forms feed back to the lemma level?

Preserving the accessing principle makes it theoretically impossible to adopt Dell's (1986, 1988) approach to the explanation of the often observed statistical overrepresentation of mixed errors (such as saying rat when the target is cat). That there is such a statistical overrepresentation is a well-assured fact since the recent paper by Martin et al. (1996). In that study 60 healthy controls and 29 aphasic speakers named a set of 175 pictures. Crucial here are the data for the former group. The authors carefully analyzed their occasional naming errors and found that when a semantic error was made there was an above-chance probability that the first or second phoneme of the error was shared with the target. This above-chance result could not be attributed to phonological similarities among semantically related words. In this study the old, often hotly debated factors such as perceiver bias, experimental induction or set effects couldn't have produced the result. Clearly, the phenomenon is real and robust (see also Rossi & Defare, 1995).

The crucial mechanism that Dell (1986, 1988), Martin et al. (1996) and Dell et al. (in press) proposed for the statistical overrepresentation of mixed errors is feedback from the word form nodes to the lemma nodes. For instance, when the lemma cat is active, the morpheme <cat> and its segments /k/, /oe/ and /t/ become active. The latter two segments feed part of their activation back to the lemma rat, which may already be active because of its semantic relation to cat. This increases the probability of selecting rat instead of the target cat. For a word such as dog, there is no such phonological facilitation of a semantic substitution error, because the segments of cat will not feed back to the lemma of dog. Also, the effect will be stronger for rat than for a semantically neutral phonologically related word, such as mat, which is totally inactive to start with. This mechanism is ruled out by our activation principle, because form activation follows selection, hence feedback cannot affect the selection process. We will not rehearse the elaborate discussions that this important issue has raised (Levelt et al., 1991a, b; Dell & O'Seaghdha, 1991, 1992; Harley, 1993). Only two points are relevant here. The first one is that, till now, there is no reaction time evidence for this proposed feedback mechanism. The second one is that there are alternative explanations possible for the statistical effects, in particular the case of mixed errors. Some of those were discussed in Levelt et al. (1991a). They were, essentially, self-monitoring explanations going back to the experiments by Baars et al. (1975), which showed that speakers can prevent the overt production of internally prepared indecent words, nonwords, or other output that violates general or task-specific criteria (more on this in Section 10). But in addition, it turns out that in WEAVER++, slightly modified to produce errors, mixed errors become overrepresented as well (see Section 10) and this doesn't require feedback. Hence, although the mixed error case has now been empirically established beyond reasonable doubt, it cannot be a decisive argument for the existence of feedback from the form to the lemma level.

6.1.3 The word frequency effect

One of the most robust findings in picture naming is the word frequency effect, discovered by Oldfield and Wingfield (1965). Producing an infrequent name (such as broom) is substantially slower than producing a frequent name (such as boat). From an extensive series of experiments (Jescheniak & Levelt, 1994) it appeared that the effect arises at the level of accessing word forms. Demonstrating this required exclusion of all other levels of processing in the theory (see Figure 1). This was relatively easy for pre- and post-lexical levels of processing, but harder for the two major levels of lexical access, lemma selection and word form access. The pre-lexical level was excluded by using Wingfield's (1968) procedure. If the frequency effect arises in accessing the lexical concept, given the picture, it should also arise in a recognition task in which the subject is given a lexical concept (for instance "boat") and has to verify the upcoming picture. There was neither a frequency effect in the "yes", nor in the "no" responses. This does not mean, of course, that infrequent objects are as easy to recognize as frequent objects, but only that for our pictures, where this was apparently well- controlled, there is still a full-fledged word frequency effect[11]. Hence, that must arise at a different level. Similarly, a late level of phonetic-articulatory preparation could be excluded. The word frequency effect always disappeared in delayed naming tasks.

The main argument for attributing the word frequency effect to word form access rather than to lemma selection stemmed from an experiment in which subjects produced homophones. Homophones are different words that are pronounced the same. Take more and moor. In our theory they differ at the lexical concept level and at the lemma level, but they share their word form (though maybe not in all dialects of English). In network representation:

MORE MOOR conceptual level

| |

more moor lemma level

\ /

<m_r> word form level

The adjective more is a high-frequency word, whereas the noun moor is low- frequency. The crucial question now is whether low-frequency moor will behave like other, non-homophonous, low-frequency words (such as marsh), or rather like other, non-homophonous high-frequency words (such as much). If word frequency is coded at the lemma level, the low-frequency homophone moor should be as hard to access as the equally low-frequency non-homophone marsh. If, however, the word frequency effect is due to accessing the word form, one should, paradoxically, predict that a low-frequency homophone such as moor will be accessed just as fast as its high-frequency twin more, because they share the word form. Jescheniak and Levelt (1994) tested these alternatives in an experiment where subjects produced low-frequency homophones (such as moor), as well as frequency-matched low-frequency non-homophones (such as marsh). In addition, there were high-frequency non-homophones, matched to the homophony twin (such as much, which is frequency-matched to more). How can one have a subject produce a low-frequency homophone? This was done by means of a translation task. The Dutch subjects, with good mastery of English, were presented with the English translation equivalent of the Dutch low-frequency homophone. As soon as the word appeared on the screen, they were to produce the Dutch translation and the reaction time was measured. And the same was done for the high- and low-frequency non-homophonous controls. In this task, reaction times are also affected by the speed of recognizing the English word. This recognition speed was independently measured in an animateness decision task. All experimental items were inanimate terms, but an equal set of fillers were animate words. The same subjects performed the, push-button, animateness decision task on the English words one week after the main experiment. Our eventual data were the difference scores, naming latency (for the Dutch response word) minus semantic decision latency (for the English stimulus word). A summary of the findings is presented in Figure 9.

We obtained the paradoxical result. The low-frequency homophones (such as moor) were statistically as fast as the high-frequency controls (such as much) and substantially faster than the low-frequency controls (such as marsh). This shows that a low- frequency homophone inherits the fast access speed of its high-frequency partner. In other words, the frequency effect arises in accessing the word form, rather than the lemma.

A related homophone effect has been obtained with speech errors. Earlier studies of sound-error corpora had already suggested that slips of the tongue occur more often on low-frequency words than on high-frequency ones (e.g., Stemberger & MacWhinney, 1986). That is, segments of frequent words tend not to be misordered. Dell (1990) showed experimentally that low-frequency homophones adopt the relative invulnerability to errors of their high-frequency counterparts, completely in line with the above findings. Also in line with these results are Nickels' (1995) data from aphasic speakers. She observed an effect of frequency on phonological errors (i.e., errors in word-form encoding) but no effect of frequency on semantic errors (i.e., errors in conceptually driven lemma retrieval). These findings suggest that the locus of the effect of frequency on speech errors is the form level.

There are, at least, two ways of modeling the effect and we have no special preference. Jescheniak and Levelt (1994) proposed to interpret it as the word form's activation threshold, low for high-frequency words and high for low-frequency words. Roelofs (in press-a) implemented the effect by varying the items' verification times as a function of frequency. Remember that, in the model, each selection must be licenced; this can take a varying amount of verification time.

Estimates of word frequency tend to correlate with estimates of age of acquisition of the words (e.g., Carroll & White, 1973; Morrison et al., 1992; Snodgrass & Yuditsky, 1996). While some researchers found an effect of word frequency on the speed of object naming over and above the effect of age of acquisition, others have argued that it is age of acquisition alone that affects object naming time. In most studies, participants were asked to estimate at what age they first learned the word. It is not unlikely, however, that word frequency "contaminates" such judgments. When more objective measures of age of acquisition are used, however, it still is a major determinant of naming latencies. Still, some studies do find an independent contribution of word frequency (see, for instance, Brysbaert, 1996). Probably, both factors contribute to naming latency. Morrison et al. (1992) compared object naming and categorization times and argued that the effect of age of acquisition arises during the retrieval of the phonological forms of the object names. This is, of course, exactly what we claim to be the case for word frequency. Pending more definite results, we will assume that both age of acquisition and word frequency affect picture naming latencies and that they affect the same processing step, i.e., accessing the word form. Hence, in our theory they can be modelled in exactly the same way, either as activation thresholds or as verification times (see above). Because the independent variable in our experiments has always been CELEX word frequency [12], we will keep indicating the resulting effect by "word frequency effect". We do acknowledge, however, that the experimental effect is probably, in part, an age of acquisition effect.

The effect is quite robust, in that it is preserved over repeated namings of the same pictures. Jescheniak and Levelt (1994) showed this to be the case for three consecutive repetitions of the same pictures. In a recent study (Levelt et al., submitted), we tested the effect over 12 repetitions. The items tested were the 21 high-frequency and 21 low-frequency words from the original experiment that were monosyllabic. Figure 10 presents the results. The subjects had inspected the pictures and their names before the naming experiment began. The 31 ms word frequency effect was preserved over the full range of 12 repetitions.

6.2 Creating Phonological Words

The main task across the rift in our system is to generate the selected word's articulatory gestures in its phonological/phonetic context. This contextual aspect of word form encoding has long been ignored in production studies, which led to a curious functional paradox.

6.2.1 A functional paradox

All classical theories of phonological encoding have, in some way or another, adopted the notion that there are frames and fillers (Fromkin, 1971; Garrett, 1975; Shattuck-Hufnagel, 1979; Dell, 1986, 1988). The frames are metrical units, such as word or syllable frames. The fillers are phonemes or clusters of phonemes that are inserted into these frames during phonological encoding. There are not only good linguistic reasons for such a distinction between structure and content, but speech error evidence seems to support the notion that constituency is usually respected in such errors. In "mell wade" (for well made) two word/syllable onsets are exchanged, in "bud beggs" (for bed bugs) two syllable nuclei are exchanged and in "god to seen" (for gone to seed) two codas are exchanged (from Boomer & Laver, 1968). This type of evidence has led to the conclusion that word forms are not retrieved from the mental lexicon as unanalyzed wholes, but rather as sublexical and subsyllabic units, which are to be positioned in structures (such as word and syllable skeletons) that are independently available (Meyer, 1997) calls this the "Standard Model" in her review of the speech error evidence). Apparently, when accessing a word's form, the speaker retrieves both structural and segmental information. Subsequently, the segments are inserted in, or attached to the structural frame, which produces their correct serial ordering and constituent structure, somewhat like this:

word form memory code

/ \

(1) segments frame retrieved from memory

\ /

word form encoded word

Shattuck-Hufnagel, who was the first to propose a frame-filling processing mechanism (the "scan copier") that could account for much of the speech error evidence, right away noticed the paradox in her 1979 paper: "perhaps its [the scan copier's] most puzzling aspect is the question of why a mechanism is proposed for the one-at-a-time serial ordering of phonemes when their order is already specified in the lexicon" (p. 338). Or, to put the paradox in more general terms: What could be the function of a mechanism that independently retrieves a word's metrical skeleton and its phonological segments from lexical memory and subsequently reunifies them during phonological encoding? It can hardly be to create the appropriate speech errors.

The paradox vanishes when the contextual aspect of phonological encoding is taken seriously. Speakers do not generate lexical words, but phonological words. And it is the phonological word, not the lexical word, that is the domain of syllabification (Nespor & Vogel, 1986). For example, in Peter doesn't understand it the syllabification of the phrase understand it does not respect lexical boundaries, i.e., it is not un-der-stand-it. Rather, it becomes un-der-stan-dit, where the last syllable, dit, straddles the lexical word boundary between understand and it. In other words, the segments are not inserted in a lexical word frame, as (1) suggests, but in a larger phonological word frame. And what will become a phonological word frame is context-dependent. The same lexical word understand will be syllabified as un-der-stand in the utterance Peter doesn't understand. Small, unstressed function words, such as it, her, him, on, are pro- or encliticized to adjacent content words if syntax allows. Similarly, the addition of inflections or derivations creates phonological word frames that exceed stored lexical frames. In understanding lexical word-final d syllabifies with the inflection: un-der- stan-ding; the phonological word (4) exceeds the lexical word. One could argue (as is done in Levelt, 1989) that in such a case the whole inflected form is stored as a lexical word. But this is quite probably not the case for a derivation such as understander, which the speaker will unhesitantly syllabify as un-der-stan-der.

Given these and similar phonological facts, the functional significance of independently retrieving a lexical word's segmental and metrical information becomes apparent. The metrical information is retrieved for the construction of phonological word frames in context. This often involves combining the metrics of two or more lexical words, or of a lexical word and an inflectional or derivational affix. Spelled-out segments are not inserted in retrieved lexical word frames, but in computed phonological word frames (but see Section 6.2.4 for further qualifications). Hence, diagram (1) should be replaced by (2):

word/morpheme form word/morpheme form memory code

/ \ / \

(2) segments frame frame segments retrieved from memory

\ \ / /

\ phon. word frame / computed 4 frame \ | /

syllabified phonological word encoded phonol. word

In fact, the process can involve any number of stored lexical forms.

Although replacing (1) by (2) removes the functional paradox, it doesn't yet answer the question why speakers do not simply concatenate fully syllabified lexical forms, i.e., say things such as un-der-stand-it or e-scort-us. This would have the advantage for the listener that each morpheme boundary will surface as a syllable boundary. But speakers have different priorities. They are in the business of generating high-speed syllabic gestures. As we suggested in 3.1.2, late, context-dependent syllabification contributes to the creation of maximally pronounceable syllables. In particular, there is a universal preference for allocating consonants to syllable onset positions, to build onset clusters that increase in sonority, and to produce codas of decreasing sonority (see especially Venneman, 1988).

So far our treatment of phonological word formation has followed the standard theory, except that the domain of encoding is not the lexical word or morpheme but the phonological word, 4. The fact that this domain differs from the lexical domain in the standard theory resolves the paradox that always clung to it. But now we have to become more specific on segments, metrical frames and the process of their association. It will become apparent that our theory of phonological encoding differs in two further important aspects from the standard theory. The first difference concerns the nature of the metrical frames and the second one concerns the lexical specification of these frames. In particular we will argue that, different from the standard theory, metrical frames do not specify syllable-internal structure and that there are no lexically specified metrical frames for words adhering to the default metrics of the language - at least for stress assigning languages such as Dutch and English. In the following we will first discuss the nature of the ingredients of phonological encoding, segments and frames and then turn to the association process itself.

6.2.2 The segments

Our theory follows the standard model in that the stored word forms are decomposed into abstract phoneme-sized units. This assumption is based on the finding that segments are the most common error units in sound errors; 60 to 90% of all sound errors are single-segment errors (see, for instance, Berg, 1988; Boomer & Laver, 1968; Fromkin, 1971; Nooteboom, 1969; Shattuck-Hufnagel, 1983; Shattuck-Hufnagel & Klatt, 1979). This does not deny the fact that other types of error units are also observed. There are, on the one hand, consonant clusters that move as units in errors; about 10 to 30% of sound errors are of this sort. They almost always involve word onset clusters. Berg (1989) showed that such moving clusters tend to be phonologically coherent, in particular with respect to sonority. Hence it may be necessary to allow for unitary spell-out of coherent word onset clusters, as proposed by Dell (1986) and Levelt (1989). There is, on the other hand, evidence for the involvement of sub-segmental phonological features in speech errors (Fromkin, 1971) as in a slip like glear plue sky. They are relatively rare, accounting for less than 5% of the sound form errors. But there is a much larger class of errors in which target and error differ in just one feature (e.g. Baris instead of Paris). Are they segment or feature errors? Shattuck-Hufnagel and Klatt (1979) and Shattuck-Hufnagel (1983) have argued that they should be considered as segment errors (but see Browman & Goldstein, 1990; Meyer, 1997). Is there any further reason to suppose that there is feature specification in the phonological spell-out of segments? Yes there is. First, there is the robust finding that targets and errors tend to share most of their features (Nooteboom, 1969; Fromkin, 1971; García-Albea et al., 1989; Garrett, 1975). Second, Stemberger (1983, 1991a, b), Stemberger and Stoel-Gammon (1991) and also Berg (1991) have provided evidence for the notion that spelled-out segments are specified for some features but unspecified for others. Another way of putting this is that the segments figuring in phonological encoding are abstract. Stemberger et al.'s analyses show that asymmetries in segment interactions can be explained by reference to feature (under)specification. In particular, segments that are, on independent linguistic grounds, specified for a particular feature tend to replace segments that are unspecified for that feature. This is true even though the feature-unspecified segment is usually the more frequent one in the language. Stemberger views this as an "addition bias" in phonological encoding. We sympathize with Stemberger's notion that phonological encoding proceeds from spelling out rather abstract, not fully specified segments to a further, context-dependent filling in of features (cf., Meyer, 1997), though we have not yet modeled it in any detail. This means at the same time that we don't agree with Mowrey and MacKay's (1990) conclusion that there are no discrete underlying segments in phonological encoding, but only motor programs to be executed. If two such programs are active at the same time, all kinds of interaction can occur between them. Mowrey and MacKay's EMG data indeed suggested that these are not whole-unit all-or-none effects. But as the authors noted themselves, such data are still compatible with the Standard Model. Nothing in that model excludes the possibility that errors also arise at a late stage of motor execution. It will be quite another, and probably impracticable, thing to show that all sound error patterns can be explained in terms of motor pattern interactions.

6.2.3. The metrical frames

As mentioned, our theory deviates from the Standard Model in terms of the nature of the metrical frames. The traditional story is based on the observation that interacting segments in sound errors typically stem from corresponding syllable positions: onsets exchange with onsets, nuclei with nuclei, and codas with codas. This "syllable-position constraint" has been used to argue for the existence of syllable frames, i.e., metrical frames that specify for syllable positions, onset, nucleus, and coda. Spelled out segments are correspondingly marked with respect to the positions they may take (onset, etc.). Segments that can appear in more than one syllable position (which is true for most English consonants) must be multiply represented with different position labels. The evidence from the observed syllable-position constraint is, however, not really compelling. Shattuck-Hufnagel (1985, 1987, 1992) has pointed out that more than 80 % of the relevant cases in the English corpora that have been analyzed are errors involving word onsets (see also Garrett, 1975, 1980). Hence, this seems to be a word onset property in the first place, not a syllable onset effect. English consonantal errors not involving word onsets are too rare to be analyzed for adherence to a positional constraint. That vowels tend to exchange with vowels must hold for the simple reason that usually no pronounceable string will result from a vowelø consonant replacement. Also, most of the positional effects other than word onset effects follow from a general segment similarity constraint: Segments tend to interact with phonemically similar segments. In short, there is no compelling reason from the English sound error evidence to assume the existence of spelled-out syllabic frames. Moreover, such stored lexical syllable frames should be frequently broken up in the generation of connected speech, for the reasons discussed in Section 6.2.1 above.

Things may be different in other languages. Analyzing a Germ