Below is the unedited penultimate draft of:

Page Mike (2000) Connectionist Modelling in Psychology: A Localist Manifesto
Behavioral and Brain Sciences 23 (4): XXX-XXX.

This is the unedited penultimate draft of a BBS target article that has been accepted for publication (Copyright 1999: Cambridge University Press U.K./U.S. -- publication date provisional) and is currently being circulated for Open Peer Commentary. This preprint is for inspection only, to help prospective commentators decide whether or not they wish to prepare a formal commentary. Please do not prepare a commentary unless you have received the hard copy, invitation, instructions and deadline information.

For information on becoming a commentator on this or other BBS target articles, write to: bbs@soton.ac.uk

For information about subscribing or purchasing offprints of the published version, with commentaries and author's response, write to: journals_subscriptions@cup.org (North America) or journals_marketing@cup.cam.ac.uk (All other countries).

 


CONNECTIONIST MODELLING IN PSYCHOLOGY:
A LOCALIST MANIFESTO


Mike Page
Medical Research Council Cognition and Brain Sciences Unit,
15, Chaucer Rd.,
Cambridge, CB2 2EF,
U.K.

mike.page@mrc-cbu.cam.ac.uk
http://www.mrc-cbu.cam.ac.uk/


Abstract:

Over the last decade, fully-distributed models have become dominant in connectionist psychological modelling, whereas the virtues of localist models have been underestimated. This target article illustrates some of the benefits of localist modelling. Localist models are characterized by the presence of localist representations rather than the absence of distributed representations. A generalized localist model is proposed that exhibits many of the properties of fully distributed models. It can be applied to a number of problems that are difficult for fully distributed models and its applicability can be extended through comparisons with a number of classic mathematical models of behaviour. There are reasons why localist models have been underused and these are addressed. In particular, many conclusions about connectionist representation, based on neuroscientific observation, are called into question. There are still some problems inherent in the application of fully distributed systems and some inadequacies in proposed solutions to these problems. In the domain of psychological modelling, localist modelling is to be preferred.

KEYWORDS: connectionist modelling, neural networks, localist, distributed, competition, choice, reaction-time, consolidation.


1. Introduction

The aim of this target article is to demonstrate the power, flexibility and plausibility of connectionist models in psychology that use localist representations. I will take care to define the terms ``localist'' and ``distributed'' in the context of connectionist models and to identify the essential points of contention between advocates of each type of model. Localist models will be related to some classic mathematical models in psychology and some of the criticisms of localism will be addressed. This approach will be contrasted with a currently popular one in which localist representations play no part. The conclusion will be that the localist approach is preferable whether one considers connectionist models as psychological-level models, or as models of the underlying brain processes.

At the time of writing, it is thirteen years since the publication of ``Parallel Distributed Processing: Explorations in the Microstructures of Cognition'' (Rumelhart, McClelland and the PDP Research Group 1986). That two-volume set has had an enormous influence on the field of psychological modelling (among others) and justifiably so, having helped to revive widespread interest in the connectionist enterprise after the seminal criticisms of Minsky and Papert (1969). In fact, despite Minsky and Papert's critique, a number of researchers (e.g., S. Amari, K. Fukushima, S. Grossberg, T. Kohonen, C. von der Malsburg, etc.) had continued to develop connectionist models throughout the 1970s, often in directions rather different from that in which the 1980's ``revival'' later found itself heading. More specifically, much of the earlier work had investigated networks in which localist representations played a prominent role, whereas, by contrast, the style of modelling that received most attention as a result of the PDP research group's work was one that had at its centre the concept of distributed representation. It is more than coincidental that the word ``distributed'' found itself centrally located in both the name of the research group and the title of its major publication but it is important to note that, in these contexts, the words ``parallel'' and ``distributed'' both refer to processing rather than to representation. Although it is unlikely that anyone would deny that processing in the brain is carried out by many different processors in parallel (i.e. at the same time) and that such processing is necessarily distributed (i.e. in space), the logic that leads from a consequent commitment to the idea of distributed processing, to an equally strong commitment to the related, but distinct, notion of distributed representation, is more debatable. In this target article I hope to show that the thoroughgoing use of distributed representations, and the learning algorithms associated with them, is very far from being mandated by a general commitment to parallel distributed processing.

As indicated above, I will advocate a modelling approach that supplements the use of distributed representations (the existence of which, in some form, nobody could deny) with the additional use of localist representations. The latter have acquired a bad reputation in some quarters. This cannot be directly attributed to the PDP books themselves, in which several of the models were localist in flavour (e.g. interactive activation and competition models, competitive learning models). Nonetheless, the terms ``PDP'' and ``distributed'' on the one hand, and ``localist'' on the other, have come to be seen as dichotomous. I will show this apparent dichotomy to be false and will identify those issues over which there is genuine disagreement.

A word of caution: ``Neural networks'' have been applied in a wide variety of other areas in which their plausibility as models of cognitive function is of no consequence. In criticizing what I see to be the overuse (or default use) of fully distributed networks, I will accordingly restrict discussion to their application in the field of connectionist modelling of cognitive or psychological function. Even within this more restricted domain there has been a large amount written about the issues addressed here. Moreover, it is my impression that the sorts of things to be said in defence of the localist position will have occurred independently to many of those engaged in such a defence. I apologize in advance, therefore, for necessarily omitting any relevant references that have so far escaped my attention. No doubt the BBS commentary will set the record straight.

The next section will define some of the terms to be used throughout this paper. As will be seen, certain subtleties in such definitions that becloud the apparent clarity of the localist/distributed divide.

2. Defining Some Terms

2.1 Basic Terms

Before defining localist and distributed representations, we establish some more basic vocabulary. In what follows the word nodes will refer to the simple units out of which connectionist networks have traditionally been constructed. A node might be thought of as consisting of a single neuron or a distinct population of neurons (e.g. a cortical minicolumn). A node will be referred to as having a level of activation, where a loose analogy is drawn between this activation and the firing rate (mean or maximum firing rate) of a neuron (population). The activation of a node might lead to an output signal's being projected from it. The projection of this signal will be deemed to be along one or more weighted connections, where the concept of weight in some way represents the variable ability of output from one node to affect processing at a connected node. The relationship between the weighted input to a given node (i.e those signals projected to it from other nodes), its activation, and the output which it in turn projects, will be summarized using a number of simple, and probably familiar, functions. All of these definitions are, I hope, an uncontroversial statement of the basic aspects of the majority of connectionist models.

2.2 Localist and Distributed Representations

The following definitions, drawn from the recent literature, largely capture the difference between localist and distributed representations. First, distributed representations:

``many neurons participate in the representation of each memory and different representations share neurons'' (Amit 1995, p.621).

``the model makes no commitment to any particular form of representation, beyond supposing that the representations are distributed; that is, each face, semantic representation, or name is represented by multiple units, and each unit represents multiple faces, semantic units or names.'' (Farah, O'Reilly & Vecera 1993, p.577)

The latter definition refers explicitly to a particular model of face naming, but the intended nature of distributed representations in general is clear. To illustrate the point, suppose we wished to represent the four entities ``John'', ``Paul'', ``George'' and ``Ringo''. Figure 1a shows distributed representations for these entities. Each representation involves a pattern of activation across four nodes and, importantly, there is overlap between the representations. For instance, the first node is active in the patterns representing both John and Ringo, the second node is active in the patterns representing both John and Paul, and so on. A corollary of this is that the identity of the entity that is currently represented cannot be unambiguously determined by inspecting the state of any single node.


  

Figure 1: Four names represented (a) in a distributed fashion, (b) in a localist fashion.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevision/Figures/Beatles.ps, width=4.5in}}\end{figure}


Now consider the skeleton of a definition of a localist representation, as contrasted with a distributed coding:

``...with a local representation, activity in individual units can be interpreted directly...with distributed coding individual units cannot be interpreted without knowing the state of other units in the network'' (Thorpe 1995, p.550).

As an example of a localist representation of our four entities, see Figure 1b. In such a representation, only one node is active for any given entity. As a result, activity at a given unit can unambiguously identify the currently represented entity.

When nodes are binary (i.e., having activity 1 or 0), these definitions are reasonably clear. But how are they affected if activity can take, for example, any value between these limits? The basic distinction remains: in the localist model, it will still be possible to interpret the state of a given node independent of the states of other nodes. A natural way to ``interpret'' the state of a node embedded in a localist model would be to propose, as did Barlow (1972), a monotonic mapping between activity and confidence in the presence of the node's referent:

``the frequency of neural impulses codes subjective certainty: a high impulse frequency in a given neuron corresponds to a high degree of confidence that the cause of the percept is present in the external world'', (Barlow 1972, p.381)

It may be that the significance of activating of a given node is assessed in relation to a threshold value, such that only super-threshold activations are capable of indicating nonzero confidence. Put another way, the function relating activation to ``degree of confidence'' would not necessarily be linear, or even continuously differentiable, in spite of being monotonic nondecreasing.

Having offered both Thorpe's and Barlow's descriptions of localist representation, it is worth pointing out that interpreting a node's activation as ``degree of confidence'' is potentially inconsistent with the desire to interpret a given node's activation ``directly,'' that is independent of the activation of other nodes. For example, suppose, in a continuous-activation version of Figure 1b, that two nodes have near maximal activity. In some circumstances we will be happy to regard this state as evidence that both the relevant referents are present in the world: in this case the interpretation of the node activations will conform to the independence assumption. In other cases, we might regard such a state as indicating some ambiguity as to whether one referent or the other is present. In these cases it is not strictly true to say that the degree of confidence in a particular referent can be assessed by looking at the activation of the relevant node alone, independent of that of other nodes. One option is to assume instead that activation maps onto relative degree of confidence, such that degree of activation is interpreted relative to that of other nodes. Although strictly inconsistent with Thorpe's desire for direct interpretation, this preserves what is essential about a localist scheme, namely that the entity about which relative confidence is being expressed is identified with a single node. Alternatively, both Thorpe's and Barlow's definitions can be simultaneously maintained if some competitive process is implemented directly (i.e mechanically), such that it is impossible to sustain simultaneously high activations at two nodes whose interpretations are contradictory. A scheme of this type would, for example, allow, two nodes to compete for activation so as to exclusively identify a single person.

As an aside, note that a simple competitive scheme has some disadvantages. Such a scheme is apparently inadequate for indicating the presence of two entities, say, John and Paul, by strongly activating the two relevant nodes simultaneously. One solution to this apparent conundrum might be to invoke the notion of binding, perhaps implemented by phase relationships in node firing patterns (e.g. Hummel & Biedermann 1992; Roelfsema, Engel, König & Singer 1996; Shastri & Ajjanagadde 1993). (Phase relationships are only one candidate means of perceptual binding and will be assumed here solely for illustrative purposes.) Thus, in the case in which we wish both John and Paul to be simultaneously indicated, both nodes can activate fully but out of phase with each other, thus diminishing the extent to which they compete. This out-of-phase relationship might stem from the fact that the two entities driving the system (John and Paul) must be in two different spatial locations, allowing them to be ``phased'' separately. In the alternative scenario, that is, when only one individual is present, the nodes representing alternative identifications might be in phase with each other, driven as they are by the same stimulus object, and would therefore compete as required.

A similar binding scheme might also be useful if distributed representations are employed. On the face of it, using the representations in Figure 1a, the pattern for John and George will be the same as that for Paul and Ringo. It may be possible to distinguish these summed patterns on the basis of binding relationships as before -- to represent John and George the first and second nodes would be in phase with each other while the third and fourth nodes would both be out of phase with the first two nodes and in phase with each other. But complications arise when we wish to represent, say, John and Paul: would the second node be in phase with the first node or the third? It is possible that in this case the second node would fire in both the phases associated with nodes one and two (though this would potentially affect its firing rate as well as its phase relationships). Mechanisms for binding are the focus of a good deal of ongoing research, so I shall not develop these ideas further here.

2.3 Grandmother Cells...

In a discussion of localist and distributed representations it is hard to avoid the subject of ``grandmother cells.'' The concept can be traced back to a lecture series delivered by Jerome Lettvin in 1969 (see Lettvin's appendix to Barlow 1995), in which he introduced, to a discussion on neural representation, an allegory, in which a neuroscientist located in the brains of his animal subjects

``some 18,000 neurons...that responded uniquely only to the animal's mother, however displayed, whether animate or stuffed, seen from before or behind, upside down or on a diagonal, or offered by caricature, photograph or abstraction.'' (from appendix to Barlow 1995)

The allegorical neuroscientist ablated the equivalent cells in a human subject, who, postoperatively, could not conceive of ``his mother,'' while maintaining a conception of mothers in general. The neuroscientist, who was intent on showing that ``ideas are contained in specific cells'', considered his position to be vulnerable to philosophical attack, and rued not having searched for grandmother cells instead, grandmothers being ``notoriously ambiguous and often formless''.

The term ``grandmother cell'' has since been used extensively in discussions of neural representation, though not always in ways consistent with Lettvin's original conception. It seems that (grand)mother cells are considered by some to be the necessary extrapolation of the localist approach and thereby to demonstrate its intrinsic folly. I believe this conclusion to be entirely unjustified. Whatever the relevance of Lettvin's allegory, it certainly does not demonstrate the necessary absurdity of (grand)mother cells and, even if it did, this would not warrant a similar conclusion regarding localist representations in general. Given the definitions so far advanced, it is clear that, while (grand)mother cells are localist representations, not all localist representations necessarily have the characteristics attributed by Lettvin to (grand)mother cells This depends on how one interprets Lettvin's words ``responded uniquely'' (above). A localist representation of one's grandmother might respond partially, but subthreshold, to a similar entity (e.g. one's great aunt), thus violating one interpretation of the ``unique response'' criterion that forms part of the grandmother-cell definition.

2.4 ...and Yellow Volkswagen Cells

A related point concerns the ``yellow Volkswagen cells'' referred to by Harris (1980). Harris' original point, which dates back to a talk given in 1968, illustrated a concern regarding a potential proliferation in the types of selective cells hypothesized to be devoted to low-level visual coding. Such a proliferation had been suggested by experiments into, for instance, the ``McCollough Effect''(McCollough 1965) which had led to the positing of detectors sensitive to particular combinations of orientation and colour. The message that has been extrapolated from Harris' observation is one concerning representational capacity: that while ``yellowness'' cells and ``Volkswagen cells'' may be reasonable, surely specific cells devoted to ``yellow Volkswagens'' are not. The fear is that if yellow VWs are to be locally represented then so must the combinatorially explosive number of equivalent combinations (e.g. lime-green Minis). There is something odd about this argument. In accepting the possibility of Volkswagen cells, it begs the question as to why the fear of combinatorial explosion is not already invoked at this level. Volkswagens themselves must presumably be definable as a constellation of a large number of adjective-noun properties (curved roof, air-cooled engine etc.), and yet accepting the existence of Volkswagen cells does not presume a vast number of other cells, one for each distinct combination of feature-values in whatever feature-space VWs inhabit. On a related point, on occasions when the (extrapolated) yellow-VW argument is invoked, it is not always clear whether the supposed combinatorial explosion refers to the number of possible percepts, which is indeed unimaginably large, or to the vanishingly smaller number of percepts which are witnessed and, in some sense, worth remembering. Since the latter number is likely to grow only approximately linearly with lifespan, fears of combinatorial explosion are unwarranted. It is perfectly consistent with the localist position that different aspects of a stimulus (e.g. colour, brand-name etc.) can be represented separately, and various schemes have been suggested for binding such aspects together so as to correctly represent, in the short term, a given scene (e.g. Hummel & Biedermann 1992; Roelfsema, Engel, König & Singer 1996; see earlier). This systematicity (cf. Fodor & Pylyshyn 1988) in the perceptual machinery addresses the problem of combinatorial explosion regarding the number of possible percepts. It in no way implies, however, that in a localist model each possible percept must be allocated its own permanent representation, that is, its own node. A similar point was made by Hummel and Holyoak (1997) who noted that

``...it is not necessary to postulate the preexistence of all possible conjunctive units. Rather a novel binding can first be represented dynamically (in active memory), with a conjunctive unit created only when it is necessary to store the binding in LTM.'' (p.434)

It is entirely consistent with the localist position to postulate that cells encoding specific combinations will be allocated only when needed: perhaps in an experiment in which pictures of yellow VWs and red bikes require one response, while red VWs and yellow bikes require another (cf. XOR); or, more prosaically, in establishing the memory that one's first car was a yellow VW. When one restricts the number of localist representations to those sufficient to describe actual percepts of behavioural significance (i.e. those which require long-term memorial representation) the threat of combinatorial explosion dissipates. Later I shall show how new localist nodes can be recruited, as needed, for the permanent representation of previously unlearned configurations (cf. the constructivist learning of Quartz & Sejnowski 1997, and the accompanying commentary by Grossberg 1997; Valiant 1994).

2.5 Featural Representations

The above discussion of yellow VWs illustrates the issue of featural representation. A featural representation will be defined here as a representation comprising an array of localist nodes in appropriate states. Figure 2 shows the featural representations of Tony Blair, Glenda Jackson, Anthony Hopkins and Queen Elizabeth II, where the relevant features are ``is-a-woman,'' ``is-a-politician'' and ``is/was-a-film-actor.'' Clearly, the representations of these four entities are distributed, in the sense that the identity of the currently present entity cannot be discerned by examining the activity of any individual node. Nonetheless, the features themselves are locally represented (cf. ``is-yellow,'' ``is-a-Volkswagen''). Whether or not a politician is currently present can be decided by examining the activity of a single node, independent of the activation of any other node.


  

Figure 2: Four persons represented in a featural fashion with regard to semantic information.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevision/Figures/Blair.ps, width=4.5in}}\end{figure}


It is curious that researchers otherwise committed to the thoroughgoing use of distributed representations have been happy to use such featural representations. For instance, Farah et al. (1993), whose commitment to distributed representations was quoted earlier, used a distributed representation for semantic information relating to particular people. To continue the earlier quotation:

``The information encoded by a given unit will be some `microfeature'...that may or may not correspond to an easily labeled feature (such as eye color in the case of faces). The only units for which we have assigned an interpretation are the `occupation units' within the semantic pool. One of them represents the semantic microfeature `actor' and the other represents the semantic microfeature `politician'.'' (Farah et al. 1993, p.577)

It would be odd to be comfortable with the idea of nodes representing ``is-an-actor,'' and yet hostile to the idea of nodes representing ``is-Tony-Blair'' or ``is-my-grandmother''. If ``is-an-actor'' is a legitimate microfeature (though one wonders what is micro about it), then why is ``is-Tony-Blair'' not? Is there any independent rationale for what can and cannot be a microfeature? Moreover, to anticipate a later discussion, by what learning mechanism are the localist (micro)featural representations (e.g. ``is-an-actor'') themselves deemed to be established? The most natural assumption is that, at some level, local unsupervised featural learning is carried out. But a commitment to fully distributed representation of identity, if not profession, would therefore require that at some arbitrary stage just before the level at which identity features (e.g. ``is-Tony-Blair'') might emerge, a different, supervised learning mechanism cuts in.

Whether or not we choose to define featural representations as a subclass of distributed representations has little to do with the core of the localist/distributed debate. No localist has ever denied the existence of distributed representations, especially, but not exclusively, if these are taken to include featural representations. To do so would have entailed a belief that percepts ``go local'' in a single step, from retina directly to grandmother cell, for instance. The key tenet of the localist position is that, on occasion, localist representations of meaningful entities in the world (e.g. words, names, people, etc.) emerge and allow, among other things, distributed/featural patterns to be reliably classified and enduringly associated.

I should make clear that restricting the definition in the preceding paragraph to ``meaningful entities in the world'' is simply a rather clumsy way of avoiding potentially sterile discussions of how far localist representation extends down the perceptual hierarchy. To take a concrete example, one might ask whether an orientation column (OC) in the visual cortex should be considered a localist representation of line segments in a particular part of the visual field and at a particular angular orientation. An opponent of such a localist description might argue that in most everyday circumstances nothing of cognitive significance (nothing of meaning, if you like) will depend on the activation state of an individual OC and that later stages in the perceptual path will best be driven by a distributed pattern of activation across a number of OCs so as to preserve the information available in the stimulus. I am sympathetic to this argument -- there seems little point in describing a representation as localist if it is never interpreted in a localist manner. Nonetheless, to temper this conclusion somewhat, imagine an experiment in which a response is learned that depends on which of two small line segments, differing only in orientation, is presented. Assuming that such a discrimination is learnable, it does not seem impossible a priori that a connectionist model of the decision task would depend rather directly on the activations of specific OCs. (The issue is related to the decoding of population vectors, discussed briefly in section 4.3.1 and in the accompanying footnote.) I have not modelled performance in this rather contrived task and hence cannot say what should be concluded from such a model. One can simply note that certain models might lead to a more charitable view towards an interpretation that treated single OCs as localist representations. The general point is that a representation might be labelled localist or not depending on the particulars of the modelled task in which the corresponding nodes are taken to be involved. Whether one chooses to reserve the term localist for representations that are habitually involved in processes/tasks that highlight their localist character or, alternatively, whether one allows the term to apply to any representational unit that can at some time (albeit in unusual or contrived circumstances) usefully be treated as localist, is probably a matter of taxonomic taste. For fear of getting unnecessarily involved in such matters, I will retreat to using the term localist to refer, as above, to a form of representation of meaningful entities in the world whose localist character is habitually displayed. I do so in the hope and belief that, at least in the modelling of most types of cognitive-psychological task, it will be clear what the relevant meaningful entities are.

2.6 So What is a Localist Model?

Given the definitions of localist and distributed representations discussed so far, what are we to understand by the term ``a localist model''? The first and most crucial point, alluded to above, is that a localist model is not well defined as one that uses localist rather than distributed representations: localist models almost always use both localist and distributed representations. More explicitly, any entity that is locally represented at layer $n$ of a hierarchy is sure to be represented in a distributed fashion at layer $n-1$. To illustrate, take as an example the interactive activation (IA) model of visual word recognition (McClelland & Rumelhart 1981; Rumelhart and McClelland 1982) which is generally agreed to be localist. The model employs successive processing layers: In the ``lowest'' of these are visual-feature detectors, which respond selectively to line segments in various orientations; in the next layer are nodes which respond selectively to letters in various positions in a word; in the third, nodes which respond maximally to individual familiar words. Thus, a given word is represented locally in the upper layer and in a distributed fashion at the two previous layers. Letters-in-position are likewise represented locally in the second layer but in a distributed manner in the first layer. It accordingly makes no sense to define a localist model as one that precludes distributed representation. A better definition relies only on whether or not there are localist representations of the relevant entities.

It so happens that, in the IA example, the distributed representations at lower layers are of the featural variety, as discussed above. This, however, is not a crucial factor in the IA model's being labelled localist: The lower layers might have used distributed representations unamenable to a featural characterization without nullifying the fact that in the upper layer a localist code is used. The difference between localist and distributed models is most often not in the nature or status of the representation of the input patterns, which depends ultimately (in vivo) on the structure and function of the relevant sense organ(s), but in the nature of representation at the later stages of processing that input. As stated above, localists posit that certain cognitively meaningful entities will be represented in a local fashion at some, probably late, level of processing, and it is at this level that decisions about which entities are identifiable in any given input can best be made.

So can the term ``localist model'' be universally applied to models using localist representations? Not without care. Consider the model of reading proposed by Plaut, McClelland, Seidenberg and Patterson (1996). This was developed from the seminal model of Seidenberg and McClelland (1989), in which neither letters at the input nor phonemes at the output were represented in a local fashion. According to Plaut et al. it was this aspect of the model, among others, which manifested itself in its relatively poor nonword reading. Plaut et al. referred to this as the ``dispersion problem''. Perhaps, as Jacobs and Grainger (1994) rather archly suggest, it might better have been termed the distribution problem, given that Plaut et al.'s solution entailed a move to an entirely local scheme for both input orthography (letters and letter clusters) and output phonemes. And yet, even with this modification, it would be very misleading to call Plaut et al.'s a localist model: The most powerful and theoretically bold contribution of that model was to show that the mapping between orthographic representations of both words and nonwords and their pronunciations could be carried out in a distributed fashion, that is, without any recourse to either a locally represented mental lexicon or an explicit system of grapheme-to-phoneme correspondence rules. So whereas the Plaut et al. model was certainly localist at the letter and phoneme levels, it was undeniably distributed at the lexical level. It is for this reason that calling that model localist would be thoroughly misleading. I conclude that the term ``localist model'' should be used with care. In most cases, it will be better to be explicit about the entities for which localist coding is used (if any), and to identify the theoretical significance of this choice.

A further point should be made regarding localist models, again taking the IA model as our example. When a word is presented to the IA model, a large number of nodes will be maximally active -- those representing certain visual features, letters-in-position and the word itself. A number of other nodes will be partially active. On presentation of a nonword, no word-node will attain maximal activation but otherwise the situation will be much the same. The simple point is this: The fact that activity is distributed widely around the network should not lead to the incautious suggestion that the IA model is a distributed rather than a localist model. As noted earlier, it is important to distinguish between distributed processing and distributed representation. Having made this distinction we can better interpret labels that have been applied to other models in the literature, labels which might otherwise have the potential to confuse.

As an illustration, consider the Learning and Inference with Schemas and Analogies (LISA) model of Hummel and Holyoak (1997), as applied to the processing of analogy. The title of the paper (``Distributed Representations of Structure: A Theory of Analogical Access and Mapping'') might suggest that LISA is a fully distributed model, but a closer look reveals that it uses localist representation. For instance, in its representation of the proposition ``John loves Mary,'' there is a node corresponding to the proposition itself, and to each of the constituents ``John'', ``Mary'' and ``loves''; these nodes project in turn onto a layer of semantic units which are crucially involved in the analogy processing task. The whole network is hierarchically structured, with activity being distributed widely for any given proposition and, in this case, organized in time so as to reflect various bindings of, for example, subjects with predicates. (Most, if not all, models that use phase binding do so in the context of localist representation.) LISA thus constitutes a clear example of the interaction between localist representations of entities and a distributed or featural representation of semantics. As in the IA model, there is no contradiction between distributed processing and localist representation. At the risk of overstating the case, we can see exactly the same coexistence of local representation, distributed representation and distributed processing in what is often considered a quintessentially localist model namely Quillian's (1968) model of semantics. Quillian's justly influential model did indeed represent each familiar word with a localist ``type'' unit. But a word's meaning was represented by an intricate web of structured connections between numerous tokens of the appropriate types, resulting, on activation of a given word-type, in a whole plane of intricately structured spreading activation through which semantic associative relationships could become apparent.

To summarize, a localist model of a particular type of entity (e.g. words) is characterized by the presence of (at least) one node which responds maximally to a given familiar (i.e. learned) example of that type (e.g. a given familiar word), all familiar examples of that type (e.g. all familiar words) being so represented. This does not preclude some redundancy in coding. For example, in the word example used here, it may be that various versions of the same word (e.g. different pronunciations) are each represented locally, though in many applications these various versions would be linked at some subsequent stage so as to reflect their lexical equivalence.

It is hoped that this definition of what constitutes a localist model will help to clarify issues of model taxonomy. Under this taxonomy, the term ``semilocalist'' would be as meaningless as the term ``semipregnant''. But what are we to make of representations that are described as ``sparse distributed'' or ``semidistributed''? It is rather difficult to answer this question in general because there is often no precise definition of what is meant by these terms. Sparse distributed representational schemes are frequently taken to be those for which few nodes activate for a given stimulus with few active nodes shared between stimuli, but this definition begs a lot of questions. For examplee, how does the definition apply to cases in which nodes have continuous rather than binary activations? To qualify as a sparse distributed representational scheme, are nodes required to activate to identical degrees for several different stimuli (cf. Kanerva's binary Sparse Distributed Memory, Kanerva 1988; Keeler 1988)? Or are nodes simply required to activate (i.e. significantly above baseline) for more than one stimulus? Certainly in areas in which the term ``sparse distributed'' is often employed, such as in the interpretation of the results of single-cell recording studies, the latter formulation is more consistent with what is actually observed. As will be pointed out later, however, it is not really clear what distinction can be made between a sparse distributed scheme defined in this way and the localist schemes discussed above -- after all, the localist IAM model would be classified as sparse distributed under this looser but more plausible definition. If the class of sparse distributed networks is defined so as to include both localist and nonlocalist networks as subclasses (as is often the case), then statements advocating the use of sparse distributed representation cannot be interpreted as a rejection of localist models.

A similar problem exists with the term ``semidistributed.'' French (1992) discusses two systems he describes as semidistributed. The first is Kanerva's sparse distributed memory (Kanerva 1988), a network of binary neurons inspired more by a digital computer metaphor than by a biological metaphor, but which nonetheless shows good tolerance to inference (principally due to the similarities it shares with the localist models described here). The second is Kruschke's (1990) ALCOVE model, which (in its implemented version at least) would be classified under the present definition as localist. French developed a third type of semidistributed network, using an algorithm which sought to ``sharpen'' hidden unit activations during learning. Unfortunately, this semidistributed network only semisolved the interference problem to which it was addressed, in that even small amounts of later learning could interfere drastically with the ability to perform mappings learned earlier. What advantage there was to be gained from using a semidistributed network was primarily to be found in a measure of time to relearn the original associations -- some compensation but hardly a satisfactory solution to the interference problem itself.

It is informative to note that French's (1992) motivation for using a semidistributed rather than a localist network was based on his assumption that localist models acquire their well-known resistance to interference by sacrificing their ability to generalize. In what follows I will question this common assumption, and others regarding localist models, thus weakening the motivation to seek semidistributed solutions to problems which localist networks already solve.

3. Organization of the Argument

Before launching into the detail of my remaining argument, I will first signpost what can be expected of the remainder of this target article. This is necessary because, as will be seen, on the way to my conclusion I make some moderately lengthy, but I hope interesting, digressions. These digressions may seem especially lengthy to those for whom mathematical modelling is of little interest. Nonetheless, I hope that the end justifies the means, particularly since the approach adopted here results in a model which is practically equivalent to several mathematical models, but with most of the mathematics taken out.

It seems essential, in writing a paper such as this, to emphasize the positive qualities of localist models as much as to note the shortcomings of their fully distributed counterparts. In the next part of the paper, accordingly, I develop a generalized localist model, which despite its simplicity, is able to exhibit generalization and attractor behaviour, abilities more commonly associated with fully distributed models. This is important because the absence of these abilities is often cited as a reason for rejecting localist models. The generalized localist model is also able to perform stable supervised and unsupervised learning and qualitatively to model effects of age of acquisition, both of which appear difficult for fully distributed models. The model is further shown to exhibit properties compatible with some mathematical formulations of great breadth, such as the Luce choice rule and the ``power law of practice'', thus extending the potential scope of its application.

In later sections I consider why, given the power of localist models, some psychological modellers have been reluctant to use them. These parts of the paper identify what I believe to be common misconceptions in the literature, in particular, those based on conclusions drawn from the domain of neuroscience. Finally, I address some of the problems of a fully distributed approach and identify certain inadequacies in some of the measures that have been proposed to overcome these problems.

4. A Generalized Localist Model

In this section I shall describe, in general terms, a localist approach to both the unsupervised learning of representations and the supervised learning of pattern associations. In characterizing such a localist approach I have sought to generalize from a number of different models (e.g. Burton 1994; Carpenter & Grossberg 1987a; 1987b; Foldiak 1991; Kohonen 1984; Murre 1992; Murre, Phaf & Wolters 1992; Nigrin 1993; Rumelhart & Zipser 1986). These models differ in their details but are similar in structure and I shall attempt to draw together the best features of each. The resulting generalized model will not necessarily be immediately applicable to any particular research project but it will, I hope, have sufficient flexibility to be adapted to many modelling situations.

4.1 A Learning Module

As a first step in building a localist system, I will identify a very simple module capable of unsupervised, self-organized learning of individual patterns and/or pattern classes. This work draws heavily on the work of Carpenter and Grossberg and colleagues (e.g. Carpenter & Grossberg 1987a; 1987b; a debt that is happily acknowledged), with a number of simplifications. The module (see Figure 3) comprises two layers of nodes, $L_{1}$ and $L_{2}$, fully connected to each other by modifiable, unidirectional ($L_{1}$-to-$L_{2}$) connections, which, prior to learning, have small, random weights, $w_{ij}$. (Throughout the paper, $w_{ij}$ will refer to the weight of the connection from the $i^{\mbox{th}}$ node in the originating layer to the $j^{\mbox{th}}$ node in the receiving layer.) For simplicity of exposition, the nodes in the lower layer will be deemed to be binary, that is, to have activations (denoted $a_{i}$) either equal to zero or to one. The extension to continuous activations will usually be necessary and is easily achieved. The input to the nodes in the upper layer will simply be sum of the activations at the lower layer weighted by the appropriate connection weights. In fact, for illustrative purposes, I shall assume here that this input to a given node is divided by a value equal to the sum of the incoming weights to that node plus a small constant (see e.g. Marshall 1990) -- this is just one of the many so-called ``normalization'' schemes typically used with such networks. Thus the input, $I_{j}$, to an upper-layer node is given by

\begin{displaymath}I_{j} = \frac{\sum_{\mbox{\tiny all i \normalsize }} a_{i}
w_{ij}}{(\sum_{\mbox{\tiny all i \normalsize }} w_{ij}) + \alpha}
\end{displaymath} (1)


where $\alpha$ is the small constant. Learning of patterns of activation at the lower layer, $L_{1}$, is simply achieved as follows. When a pattern of activation is presented at $L_{1}$, the inputs, $I_{j}$, to nodes in the upper layer, $L_{2}$, can be calculated. Any $L_{2}$ node whose vector of incoming weights is parallel (i.e. a constant multiple of) the vector of activations at $L_{1}$ will have input, $I_{j}$ equal to $\frac{1}{1+ \alpha /(\sum_{\mbox{\tiny all i
\normalsize }} w_{ij})}$. Any $L_{2}$ node whose vector of incoming weights is orthogonal to the current input vector (that is nodes for which $w_{ij}=0$ where $a_{i}=1$) will have zero input. Nodes with weight vectors between these two extremes, whose weight vectors ``match'' the current activation vector to some nonmaximal extent, will have intermediate values of $I_{j}$. Let us suppose that, on presentation of a given $L_{1}$ pattern, no $L_{2}$ node achieves an input, $I_{j}$, greater than a threshold $\theta$. (With $\theta$ set appropriately, this supposition will hold when no learning has yet been carried out in the $L_{1}$-to-$L_{2}$ connections.) In this case, learning of the current input pattern will proceed. Learning will comprise setting the weights incoming to a single currently ``uncommitted'' $L_{2}$ node (i.e. a node with small, random incoming weights) to equal the corresponding activations at $L_{1}$ -- a possible mechanism is discussed later. The learning rule might thus be stated

\begin{displaymath}\frac{dw_{iJ}}{dt} = \lambda (a_{i}-w_{iJ})
\end{displaymath} (2)


where $J$ indexes the single $L_{2}$ node at which learning is being performed, $\lambda$ is the learning rate and, in the case of so-called ``fast learning'', the weight values reach their equilibrium values in one learning episode, such that $w_{iJ}=a_{i}$. The $L_{2}$node indexed by $J$ is thereafter labelled as being ``committed'' to the activation pattern at $L_{1}$, and will receive its maximal input on subsequent presentation of that pattern.


  

Figure 3: A simple two-layer module.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/module.ps , width=4.5in}}\end{figure}


The course of learning is largely determined by the setting of the threshold, $\theta$. This is closely analogous to the vigilance parameter in the adaptive resonance theory (ART) networks of Carpenter and Grossberg (1987a; 1987b), one version of which (ART2a, Carpenter, Grossberg & Rosen 1991) is very similar to the network described here. More particularly, if the threshold is set very high, for example $\theta = 1$, then each activation pattern presented will lead to a learning episode involving commitment of a previously uncommitted $L_{2}$ node, even if the same pattern has already been learned previously. If the threshold is set slightly lower, then only activation patterns sufficiently different from previously presented patterns will provoke learning. Thus novel patterns will come to be represented by a newly assigned $L_{2}$ node, without interfering with any learning that has previously been accomplished. This is a crucial point in the debate between localist and distributed modellers and concerns ``catastrophic interference'', a topic to which I shall return in greater depth later.

The value of the threshold, $\theta$, need not necessarily remain constant over time. This is where the concept of vigilance is useful. At times when the vigilance of a network is low, new patterns will be unlikely to be learned and responses (see below) will be based on previously acquired information. In situations where vigilance, $\theta$, is set high, learning of the current $L_{1}$ pattern is likely to occur. Thus learning can, to some extent, be influenced by the state of vigilance in which the network finds itself at the time of pattern presentation.

In order to make these notions more concrete, it is necessary to describe what constitutes a response for the network described above. On presentation of a pattern of activation at $L_{1}$, the inputs to the $L_{2}$ nodes can be calculated as above. These inputs are then thresholded so that the net input, $I^{\mbox{\tiny net
\normalsize }}_{j}$ to the upper-layer nodes is given by

\begin{displaymath}I^{\mbox{\tiny net \normalsize }}_{j} = \max (0, [ I_{j} - \theta ])
\end{displaymath} (3)


so that nodes with $I_{j}$ less than $\theta$ will receive no net input, other nodes receiving net input equal to the degree by which $I_{j}$ exceeds $\theta$. Given these net inputs, there are several options as to how we might proceed. One option is to allow the $L_{2}$ activations to equilibrate via the differential equation

\begin{displaymath}\frac{da_{j}}{dt}= -a_{j} + f(I^{\mbox{\tiny net \normalsize }}_{j})
\end{displaymath} (4)


which reaches equilibrium when $\frac{da_{j}}{dt} = 0$, that is when $a_{j}$ is some function, $f$, of the net input. A common choice is to assume that $f$ is the identity function, so that the activations equal the net inputs. Another option, which will be useful when some sort of decision process is required, is to allow the $L_{2}$ nodes to compete in some way. This option will be developed in some detail in the next section because it proves to have some very attractive properties. For the moment it is sufficient to note that a competitive process can lead to the selection of one of the $L_{2}$nodes. This might be achieved by some winner-takes-all mechanism, by which the $L_{2}$ nodes compete for activation until one of them quenches the activation of its competitors and activates strongly itself. Or it may take the form of a ``horse-race'', by which the $L_{2}$ nodes race, under the influence of the bottom-up inputs, $I_{j}^{\mbox{\tiny net \normalsize }}$, to reach a criterion level of activation, $\chi$. Either way, in the absence of noise, we will expect the $L_{2}$ node with the greatest net input, $I_{j}^{\mbox{\tiny net \normalsize }}$, to be selected. In the case where the $J^{\mbox{th}}$ $L_{2}$ node is selected, the pattern at $L_{1}$ will deemed to have fallen into the $J^{\mbox{th}}$ pattern class. (Note that in a high-$\theta$ regime there may be as many classes as patterns presented.) On presentation of a given input pattern, the selection (in the absence of noise) of a given $L_{2}$ node indicates that the current pattern is most similar to the pattern learned by the selected node, and that the similarity is greater than some threshold value.

To summarize, in its noncompetitive form, the network will respond such that the activations of $L_{2}$ nodes, in response to a given input ($L_{1}$) pattern, will equilibrate to values equal to some function of the degree of similarity between their learned pattern and the input pattern. In its competitive form the network performs a classification of the $L_{1}$ activation pattern, where the classes correspond to the previously learned patterns. This results in sustained or super-criterion activation ( $a_{i} > \chi $) of the node that has previously learned the activation pattern most similar to that currently presented. In both cases, the network is self-organizing and unsupervised. ``Self-organizing'' refers to the fact that the network can proceed autonomously, there being, for instance, no separate phases for learning and for performance. ``Unsupervised'' is used here to mean that the network does not receive any external ``teaching'' signal informing it how it should classify the current pattern. As will be seen later, similar networks will be used when supervised learning is required. In the meantime, I shall simply assume that the selection of an $L_{2}$ node will be sufficient to elicit a response associated with that node (cf. Usher & McClelland 1995).

4.2 A Competitive Race

In this section I will give details of one way in which competition can be introduced into the simple module described above. Although the network itself is simple, I will show that it has some extremely interesting properties relating to choice accuracy and choice reaction-time. I make no claim to be the first to note each of these properties; nonetheless, I believe the power which in combination they afford has either gone unnoticed or has been widely underappreciated.

Competition in the layer $L_{2}$ is simulated using a standard ``leaky integrator'' model which describes how several nodes, each driven by its own input signal, activate in the face of decay and competition (i.e. inhibition) from each of the other nodes:

\begin{displaymath}\frac{da_{j}}{dt} = -Aa_{j} + (I^{\mbox{\tiny net \normalsize...
...N_{1}) +
f_{1}(a_{j}) - C \sum_{k \neq j} f_{2}(a_{k}) + N_{2}
\end{displaymath} (5)


where $A$ is a decay constant, $I_{j}$ is the excitatory input to the $j^{\mbox{th}}$ node which is perturbed by zero-mean Gaussian noise, $N_{1}$, with variance $\sigma_{1}^{2}$ , $f_{1}(a_{j})$ is a self-excitatory term, $C \sum_{k \neq j} f_{2}(a_{k})$ represents lateral inhibition from other nodes in $L_{2}$, and $N_2$ represents zero-mean Gaussian noise with variance $\sigma^2$. The value of the noise term $N_{1}$ remains constant over the time course of a single competition since it is intended to represent inaccuracies in ``measurement'' of $I_{j}$. By contrast, the value of $N_2$ varies with each time step, representing moment-by-moment noise in the calculation of the derivative. Such an equation has a long history in neural modelling, featuring strongly in the work of Grossberg from the mid 1960s onwards and later in, for instance, the cascade equation of McClelland (1979).

4.2.1 Reaction Time

Recently, Usher and McClelland (1995) have used such an equation to model the time-course of perceptual choice. They show that, in simulating various two-alternative forced choice experiments, the above equation subsumes optimal classical diffusion processes (e.g. Ratcliff 1978) when a response criterion is placed on the difference between the activations, $a_{j}$, of two competing nodes. Moreover, they show that near optimal performance is exhibited when a response criterion is placed on the absolute value of the activations (as opposed to the difference between them) in cases where, as here, mutual inhibition is assumed. The latter case is easily extensible to multiway choices. This lateral inhibitory model therefore simulates the process by which multiple nodes, those in $L_{2}$, can activate in response to noisy, bottom-up, excitatory signals, $I^{\mbox{\tiny net
\normalsize }}$, and compete until one of the nodes reaches a response criterion based on its activation alone. Usher and McClelland (1995) have thus shown that a localist model can give a good account of the time course of multiway choices.

4.2.2 Accuracy and the Luce Choice Rule

Another interesting feature of the lateral inhibitory equation concerns the accuracy with which it is able to select the appropriate node (preferably that with the largest bottom-up input, $I_{j}$) in the presence of input noise, $N_{1}$, and fast-varying noise, $N_2$. Simulations show that in the case where the variances of the two noise terms are approximately equal, the effects of the input noise, $N_{1}$, dominate -- this is simply because the leaky integrator tends to act so as to filter out the effects of the fast-varying noise, $N_2$. As a result, the competitive process tends to ``select'' that node which receives the maximal noisy-input, $(I^{\mbox{\tiny net \normalsize }}_{j} + N_{1})$. This process, by which a node is chosen by adding zero-mean Gaussian noise to its input term and picking the node with the largest resultant input, is known as a Thurstonian process (Thurstone 1927, Case V). Implementing a Thurstonian process with a lateral-inhibitory network of leaky integrators, as above, rather than by simple computation, allows the dynamics as well as the accuracy of the decision process to be simulated.

The fact that the competitive process is equivalent to a classical Thurstonian (noisy-pick-the-biggest) process performed on the inputs, $I_{j}$, is extremely useful because it allows us to make a link with the Luce choice rule (Luce 1959), ubiquitous in models of choice behaviour. This rule states that, given a stimulus indexed by $i$, a set of possible responses indexed by $j$, and some set of similarities, $\eta_{ij}$ between the stimulus and those stimuli associated with each member of the response set, then the probability of choosing any particular response $J$ when presented with a stimulus $i$ is

\begin{displaymath}p(J\vert i) = \frac{\eta_{iJ}}{\sum_{\mbox{\tiny all k \normalsize }} \eta_{ik}}
\end{displaymath} (6)


Naturally this ensures that the probabilities add to one across the whole response set. Shepard (1958 1987) has proposed a law of generalization that states, in this context, that the similarities of two stimuli are an exponential function of the distance between those two stimuli in a multidimensional space. (The character of the multidimensional space that the stimuli inhabit can be revealed by multidimensional scaling applied to the stimulus-response confusion matrices.) Thus $\eta_{ij} = e^{-d_{ij}}$ where

 \begin{displaymath}
d_{ij} = \left[ \sum_{m=1}^{M} c_{m} \vert i_{m} - j_{m}\vert^{r} \right] ^{1/r}
\end{displaymath} (7)


where the distance is measured in $M$-dimensional space, $i_{m}$represents the coordinate of stimulus $i$ along the $m^{\mbox{\tiny th
\normalsize }}$ dimension and $c_{m}$ represents a scaling parameter for distances measured along dimension $m$. The scaling parameters simply weight the contributions of different dimensions to the overall distance measure, much as one might weight differently various factors such as reliability and colour when choosing a new car. Equation 7 is known as the Minkowski power model formula and $d_{ij}$ reduces to the ``city-block'' distance for $r=1$ and the Euclidean distance for $r=2$, these two measures being those most commonly used.

So how does the Luce choice rule acting over exponentiated distances relate to the Thurstonian (noisy-choice) process described above? The illustration is easiest to perform for the case of two-alternative choice, and is the same as that found in McClelland (1991). Suppose we have a categorization experiment in which a subject sees one exemplar of a category A and one exemplar of a category B. We then present the subject with a test exemplar, $T$, and ask them to decide whether it should be categorized as being from category A or from category B. Suppose further that each of the three exemplars can be represented by a point in an appropriate multidimensional space such that the test exemplar lies at a distance $d_{A}$ from the A exemplar and $d_{B}$ from the B exemplar. This situation is illustrated for a two-dimensional space in Figure 4. Note that coordinates on any given dimension are given in terms of the relevant scaling parameters, with increase of a given scaling parameter resulting in an increase in magnitude of distances measured along that dimension and contributed to the overall distance measures $d_{A}$ and $d_{B}$. (It is for this reason that an increase of scaling parameter along a given dimension is often described as a stretching of space along that dimension.) The Luce choice rule with exponential generalization implies that the probability of placing the test exemplar in category A is

\begin{displaymath}p(\mbox{test is an A}) = \frac{e^{-d_{A}}}{e^{-d_{A}} + e^{-d_{B}}}
\end{displaymath} (8)


dividing through by $e^{-d_{A}}$ gives

 \begin{displaymath}
p(\mbox{test is an A}) = \frac{1}{1 + e^{d_{A}-d_{B}}}
\end{displaymath} (9)


which equals 0.5 when $d_{A}=d_{B}$. This function is called the logistic function and it is extremely similar to the function describing the (scaled) cumulative area under a Normal (i.e. Gaussian) curve. This means that there is a close correspondence between the two following procedures for probabilistically picking one of two responses at distances $d_{A}$ and $d_{B}$ from the current stimulus: one can either add Gaussian noise to $d_{A}$ and $d_{B}$ and pick the category corresponding to, in this case, the smallest resulting value (a Thurstonian process); or one can exponentiate the negative distances and pick using the Luce choice rule. The two procedures will not give identical results, but in most experimental situations will be indistinguishable (e.g. Kornbrot 1978; Luce 1959; Nosofsky 1985; van Santen & Bamber 1981; Yellott 1977). (In fact, if the noise is double exponential rather than Gaussian the correspondence is exact, see Yellott 1977.)


  

Figure 4: Locations of exemplars A and B and test pattern T in a two-dimensional space. The distances dA and dB are Euclidean distances and coordinates on each dimension are given in terms of scaling parameters, ci.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/2d_space.ps, width=4.5in}}\end{figure}


The consequences for localist modelling of this correspondence, which extends to multichoice situations, are profound. Two things should be borne in mind. First, that a point in multidimensional space can be represented by a vector of activations across a layer of nodes, say the $L_{1}$ layer of the module discussed earlier, and/or by a vector of weights, perhaps those weights connecting the set of $L_{1}$ nodes to a given $L_{2}$ node. Second, that taking two nodes with activations $d_{A}$ and $d_{B}$, adding zero-mean Gaussian noise and picking the node with the smallest resulting activation is equivalent, in terms of the node chosen, to taking two nodes with activations $(k
- d_{A})$ and $(k - d_{B})$ (where $k$ is a constant), adding the same zero-mean Gaussian noise and this time picking the node with the largest activation. Consequently, suppose that we have two $L_{2}$nodes, such that each has a vector of incoming weights corresponding to the multidimensional vector representing one of the training patterns, one node representing pattern $A$, the other node representing pattern $B$. Further, suppose that on presentation of the pattern of activations corresponding to the test pattern, $T$, at layer $L_{1}$, the inputs, $I_{j}$, to the two nodes are equal to $(k
- d_{A})$ and $(k - d_{B})$ respectively (this is easily achieved). Under these circumstances, a Thurstonian process, like that described above, which noisily picks the $L_{2}$ node with the biggest input, will give a pattern of probabilities of classifying the test pattern either as an $A$ or, alternatively, as a $B$, which is in close correspondence to the pattern of probabilities that would be obtained by application of the Luce choice rule to the exponentiated distances, $e^{-d_{ij}}$ (Equation 9).

4.2.3 Relation to Models of Categorization

This correspondence means that mathematical models, such as Nosofsky's generalized context model (Nosofsky 1986), can be condensed, doing away with the stage in which distances are exponentiated, and the stage in which these exponentiated values are manipulated, in the Luce formulation, to produce probabilities (a stage for which, to my knowledge, no simple ``neural'' mechanism has been suggested 1), leaving a basic Thurstonian noisy-choice process, like that above, acting on (a constant minus) the distances themselves. Since the generalized context model (Nosofsky 1986) is, under certain conditions, mathematically equivalent to the models of Estes (1986), Medin and Schaffer (1978), Oden and Massaro (1978), Fried and Holyoak (1984) and others (see Nosofsky 1990), all these models can, under those conditions, similarly be approximated to a close degree by the simple localist connectionist model described here. A generalized exemplar model is obtained under high-$\theta$ (i.e. high threshold or high vigilance) conditions, when all training patterns are stored as weight vectors abutting a distinct $L_{2}$ node (one per node).

4.2.4 Effects of Multiple Instances

We can now raise the question of what happens when, in the simple two-choice categorization experiment discussed above, the subject sees multiple presentations of each of the example stimuli before classifying the test stimulus. To avoid any ambiguity I will describe the two training stimuli as ``exemplars'' and each presentation of a given stimulus as an ``instance'' of the relevant exemplar. For example, in a given experiment a subject might see ten instances of a single A-category exemplar and ten instances of a single B-category exemplar. Let us now assume that a high-$\theta$ classification network assigns a distinct $L_{2}$ node to each of the ten instances of the category A exemplar and to each of the ten instances of the category B exemplar. There will thus be twenty nodes racing to classify any given test stimulus. For simplicity, we can assume that the learned, bottom-up weight vectors to $L_{2}$ nodes representing instances of the same exemplar are identical. On presentation of the test stimulus, which lies at distance $d_{A}$ (in multidimensional space) from instances of the category A exemplar and distance $d_{B}$from instances of the category B exemplar, the inputs to the $L_{2}$nodes representing category A instances will be $(k - d_{A}) +
N_{1j}$, where $N_{1}$ is, as before, the input-noise term and the subscript $j$ ($0<j<10$) indicates that the zero-mean Gaussian noise term will have a different value for each of the ten nodes representing different instances. The $L_{2}$ nodes representing instances of the category B exemplar will likewise have inputs equal to $(k - d_{B}) + N_{1j}$ for $10<j<20$. So what is the effect on performance of adding these extra instances of each exemplar? The competitive network will once again select the $L_{2}$ node with the largest noisy input. It turns out that, as more and more instances of each exemplar are added, two things happen: firstly, the noisy-pick-the-biggest process becomes an increasingly better approximation to the Luce formulation, until for an asymptotically large number of instances the correspondence is exact; secondly, performance, as measured by the probability of picking the category whose exemplar falls closest to the test stimulus, improves, a change which is equivalent to stretching the multidimensional space in the Luce formulation by increasing by a common multiplier the values of all the scaling parameters $c_{m}$ in the distance calculation given in Equation 7. For the mathematically inclined, I note that both these effects come about due to the fact that the maximum value out of $N$ samples from a Gaussian distribution is itself cumulatively distributed as a double exponential, $\exp (-e^{-ax})$, where $a$ is a constant. The distribution of differences between two values drawn from the corresponding density function is a logistic function, comparable with that implicit in the Luce choice rule (for further details see e.g. Yellott 1977; Page & Nimmo-Smith in preparation).

To summarize, for single instances of each exemplar in a categorization experiment, performance of the Thurstonian process is a good enough approximation to that produced by the Luce choice rule so as to make the two difficult to distinguish by experiment. As more instances of the training exemplars are added to the network, the Thurstonian process makes a rapid approach towards asymptotic performance that is precisely equivalent to that produced by application of the Luce choice rule to a multidimensional space that has been uniformly ``stretched'' (by increasing by a common multiplier the values of all the scaling parameters $c_{m}$ in Equation 7) relative to that space (i.e. that set of scaling parameters) that might have been inferred from application of same choice rule to the pattern of responses found after only a single instance of each exemplar had been presented. It should be noted that putting multiple noiseless instances into the Luce choice rule will not produce an improvement in performance relative to that choice rule applied to single instances -- in mathematical terminology, the Luce choice rule is insensitive to uniform expansion of the set (Yellott 1977).

Simulations using the Luce formulation (e.g. Nosofsky 1987) have typically used uniform multiplicative increases in the values of the dimensional scaling parameters (the $c_{m}$ in Equation 7) to account for performance improvement over training blocks. The Thurstonian process described here, therefore, has the potential advantage that this progressive space-stretching is a natural feature of the model as more instances are learned. Of course, until the model has been formally fitted to experimental data the suggestion of such an advantage must remain tentative - indeed early simulations of data from Nosofsky (1987) suggest that some parametric stretching of stimulus space is still required to maintain the excellent model-to-data fits that Nosofsky achieved (Page & Nimmo-Smith in preparation). Nonetheless, the present Thurstonian analysis potentially unites a good deal of data, as well as raising a fundamental question regarding Shepard's ``universal law of generalization'': could it be that the widespread success encountered when a linear response rule (Luce) is applied to representations with exponential generalization-gradients in multidimensional stimulus-space, is really a consequence of a Thurstonian decision process acting on an exemplar model, in which each of the exemplars actually responds with a linear generalization gradient? It is impossible in principle to choose experimentally between these two characterizations for experiments containing a reasonable number of instances of each exemplar; the Thurstonian (noisy-pick-the-biggest) approach has the advantage that its ``neural'' implementation is, it appears, almost embarrassingly simple.

4.2.5 The Power Law of Practice

On the basis of the work described above, we can conclude that a simple localist, competitive model is capable of modelling data relating to both choice reaction-time and choice probability. In this section I will make a further link with the so-called ``power-law of practice''. There is a large amount of data which support the conclusion that reaction-time varies as a power function of practice, that is $\mbox{RT} = A + B N^{-c}$, where $N$ is the number of previous practice trials, and $A$, $B$ and $c$ are positive constants (see Logan 1992 for a review). In a number of papers, Logan (1988; 1992) has proposed that this result can be modelled by making the following assumptions. First, each experience with a stimulus pattern is obligatorily stored in memory and associated with the appropriate response. Second, on later presentation of that stimulus pattern, all representations of that stimulus race with each other to reach a response criterion, the response time being the time at which the first of the representations reaches that criterion. The distribution of times-to-criterion for a given representation is assumed to be the same for all representations of a given stimulus, and independent of the number of such representations. Logan has shown that if the time-to-criterion for a given stimulus representation is distributed as a Weibull function (which, within the limits of experimental error, it is -- see Logan 1992), then, using minimum-value theorem, the distribution of first-arrival times for a number of such representations is a related Weibull function, giving a power function of practice. He has collected a variety of data which indicate that this model gives a good fit to data obtained from practiced tasks, and thus a good fit to the power law of practice (e.g. Logan 1988; 1992). At several points, Logan makes analogies with the exemplar models of categorization and identification discussed above but does not develop the point further.

This link, between Logan's instance model and exemplar models of categorization, has, however, been most fully developed in two recent papers by Robert Nosofsky and Thomas Palmeri (Nosofsky & Palmeri 1997; Palmeri 1997). While they draw on Logan's instance theory of reaction-time (RT) speed-up, they identify a flaw in its behaviour. Their criticism involves observing that Logan's theory does not take stimulus similarity into account in its predictions regarding RT. To illustrate their point, suppose we have trained our subject, as above, with ten instances of a category-A exemplar and ten instances of a category-B exemplar. Consistent with Logan's instance theory, had we tested performance on the test exemplar at various points during training we would have observed some RT speed-up -- the RT is decided by the time at which the first of the multiple stored instances crosses the winning line in a horse-race to criterion. The time taken for this first-crossing decreases as a power-law of the number of instances. But what happens if we now train with one or more instances of an exemplar which is very similar to the category-A exemplar, but is indicated as belonging to category-B? In Logan's model, the addition of this third exemplar can only speed up responses when the category-A exemplar is itself presented as the test stimulus. But intuitively, and actually in the data, such a manipulation has the opposite effect, that is, it slows responses.

Nososfsky and Palmeri (1997) solve this problem by introducing the exemplar-based random walk model (EBRW). In this model, each training instance is represented in memory, just as in Logan's model. Unlike in Logan's model, however, the response to a given stimulus is not given by a single race-to-criterion. For exemplar similarities to have the appropriate effect, two changes are made: First, each exemplar races with a speed proportional to its similarity to the test stimulus. Second, the result of a given race-to-criterion does not have an immediate effect on the response but instead drives a random-walk category-decision process similar to that found in Ratcliff's (1978) diffusion model -- multiple races-to-criterion are held consecutively, with their results accumulating until a time when the cumulative number of results indicating one category exceeds the cumulative number indicating the other category by a given margin. Briefly, as the number of instances increases, each of the races-to-criterion takes a shorter time, as in the Logan model; to the extent that the test stimulus falls clearly into one category rather than another, the results of consecutive races will be consistent in indicating the same category and the overall response criterion will be more quickly reached.

I now suggest an extension of the Thurstonian model developed earlier which accounts, qualitatively at least, for the data discussed by Nosofsky and Palmeri (1997) and addresses the problems inherent in Logan's instance model. The network is depicted in Figure 5. The activation pattern across the lower layer, $L_{1}$, is equivalent to the vector in multidimensional space representing the current test stimulus. Each node in the middle layer $L_{2}$ represents a previously learned instance and activates to a degree inversely and linearly related to the distance between that learned instance and the current test stimulus (as before), plus zero-mean Gaussian noise, $N_{1}$. The third layer of nodes, $L_{3}$, contains a node for each of the possible category responses; $L_{2}$nodes are linked by a connection of unit strength to the appropriate $L_{3}$ response node and to no other. The effective input to each of the $L_{3}$ nodes is given by the maximum activation value across the connected $L_{2}$ nodes. Thus the input driving a category-A response will be the maximum activation across those $L_{2}$ nodes associated with a category-A response. The response nodes in $L_{3}$ compete using lateral-inhibitory leaky-integrator dynamics, as before. This competition, in the absence of large amounts of fast-varying noise, will essentially pick the response with the largest input, thus selecting, as before, the category associated with the $L_{2}$instance node with the largest noise-perturbed activation. The selection process therefore gives identical results, in terms of accuracy of response, to the simple Thurstonian model developed earlier and hence maintains its asymptotic equivalence with the Luce choice rule. The reaction time that elapses before a decision is made depends on two things: the number of instance representations available in $L_{2}$; and the strength of competition from alternative responses. As more instances become represented at $L_{2}$, the maximum value of the noisy activations creeps up, according to the maximum-value theorem, thus speeding highly practiced responses. To the extent that a given test stimulus falls near to instances of two different categories, the lateral inhibitory signals experienced by the response node that is eventually selected will be higher, thus delaying the response for difficult-to-make decisions, as required by the data.


  

Figure 5: A module which displays power-law speed-up with practice.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/Logan_module.ps, width=4.5in}}\end{figure}


Does the reaction-time speed-up with practice exhibited by this model fit the observed power law? I have done many simulations, under a variety of conditions, all of which produced the pattern of results shown in the graph in Figure 6. The graph plots the mean reaction time, taken over 1000 trials, against number of instances of each response category. As can be seen the speed-up in reaction time with practice is fitted very well by a power function of practice, $A + B.N^{-c}$. The fact that the time axis can be arbitrarily scaled, and the exponent of the power curve can be fitted by varying the signal-to-noise ratio on the input signals, bodes well for the chances of fitting this Thurstonian model to the practice data -- this is the subject of current work. We know already that this signal-to-noise ratio also determines the accuracy with which the network responds, and the speed with which this accuracy itself improves with practice. While it is clear that it will be possible to fit either the accuracy performance or the RT performance with a given set of parameters, it remains to be seen whether a single set of parameters will suffice for fitting both simultaneously.


  

Figure 6: The model's performance compared with a power function. The time-axis can be scaled arbitrarily.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/power_law.ps, angle=-90 , width=4.5in}}\end{figure}


4.2.6 Summary

To summarize this section of the paper, I have illustrated how a simple, localist, lateral inhibitory network, fed by appropriate noisy bottom-up inputs delivered by some, presumably hierarchical, feature-extraction process, inherits considerable power from its close relationship to a number of classic mathematical models of behaviour. The network implements a Thurstonian choice-process which gives accuracy results which are indistinguishable (at asymptote) from those produced by application of the Luce choice rule to representations with exponential generalization gradients. It shows accuracy which improves with an increase in the number of training instances, equivalent in the Shepard-Luce-Nosofsky formulation to the uniform stretching of multidimensional space. With regard to reaction time, the network's RT distributions are very similar to those produced by Ratcliff's (1978) diffusion model (see Usher & McClelland 1995) and to those found in experimental data. The RTs reflect category similarities and are speeded as a power-law with practice. This simple localist model thus provides a qualitative synthesis of a large range of data and offers considerable hope that this breadth of coverage will be maintained when quantitative fitting is attempted.

To this point I have not, at times, made a clear distinction between supervised and unsupervised learning. Underlying this conflation is the belief that the mechanisms underlying the two types of behaviour largely overlap -- more particularly, that unsupervised learning is a necessary component of supervised, or association, learning. This will, I hope, become more clear later. First I shall describe qualitatively how variants of the localist model discussed above can exhibit a number of behaviours more commonly associated with distributed models, and at least one behaviour which has proved difficult to model.

4.3 Generalization, Attractor Behaviour, ``Clean-up'' and Categorical Perception

4.3.1 Generalization

It is often stated as one of the advantages of networks using distributed representations that they permit generalization, which means that they are able to deal appropriately with patterns of information they have not previously experienced by extrapolating from those patterns which they have experienced and learned. In a similar vein, such networks have been said to be robust, their performance worsening only gradually in the presence of noisy or incomplete input. Generalization and robustness are essentially the same thing: both refer to the networks' ability to deal with inputs which only partially match previous experience. One very common inference has been that networks using localist representations do not share these abilities. In this section I show that this inference is unjustified.

First, the previous sections have illustrated how one of the most resilient mathematical models of the stimulus-response generalization process can be cast in the form of a simple localist network. Put simply, in the face of a novel, or noisy, stimulus, the input signals to a layer of nodes, whose weights encode patterns of activation encountered in previous experiences, will reflect, in a graded fashion, the degree of similarity that the current input shares with each of those learned patterns. If the current pattern is not sufficiently similar to any learned pattern to evoke super-threshold input, then no generalization will be possible, but to the extent that similarities exist, the network can choose between competing classifications/responses on the basis developed above. It will be possible to vary the breadth of generalization that can be tolerated by varying the input threshold, $\theta$. Thus if no $L_{2}$ node receives super-threshold input, yet generalization is required, the threshold can simply be dropped until input, upon which a response can be based, is forthcoming.

Of course, the stimulus-response model described above only generalizes in the sense that it generates the most appropriate of its stock of previously learned responses on presentation of an unfamiliar stimulus. This type of generalization will not always be appropriate: imagine a localist system for mapping orthography to phonology, in which each familiar word is represented by a node which alone activates sufficiently in the presence of that word to drive a representation of that word's phonology. Would this system exhibit generalization on presentation of a novel orthographic string (i.e. a nonword)? Only in the sense that it would output the phonology of the word that best matched the unfamiliar orthographic input. This is not the sort of generalization that human readers perform in these circumstances; they are content to generate novel phonemic output in response to novel orthographic input. The localist approach to simulating this latter ability relies on the fact that in quasiregular mappings, like that between orthography and phonology, in which both the input pattern (i.e. the letter string) and the output pattern (i.e. the phonemic string) are decomposable into parts, and in which each orthographic part has a corresponding phonemic part with which it normatively corresponds, the localist model can perform generalization by input decomposition and output assembly. Specifically, although the unfamiliar orthographic string cannot activate a localist representation of the complete nonword (since by definition no such representation exists), it can lead to activation in localist units representing orthographic subparts, such as onset cluster, rime, vowel, coda etc., and each of these can in turn activate that portion of the phonological output pattern with which it is most usually associated. This idea of generalization by input decomposition and output assembly for nonwords, supplemented by a dominant, but not exclusive, direct route for known words is, of course, the strategy used by many localist modellers of single-word reading (Coltheart, Curtis, Atkins & Haller 1993; Norris 1994a; Zorzi, Houghton & Butterworth 1998).

In nonregular domains, where generalization by decomposition and assembly is not possible, the tendency of localist models either to fail to generalize or, when appropriate, to perform generalization to the best matching stock response, might be seen as a distinct advantage. (Many of the points that follow can also be found in Forster 1994, but I believe they bear repetition.) Take the mapping from orthography to semantics, or the mapping from faces to proper names: Is it appropriate to generalize when asked to name an unfamiliar face? Or when asked to give the meaning of a nonword? In a localist model of the general type developed above, the threshold for activating the localist representation of a known face or a known word can be set high enough such that no stock response is generated for such unfamiliar stimuli. When a stock response is required, such as to the question ``Which familiar person does this unfamiliar person most resemble?'', the input threshold might still be dropped, as described above, until a response is forthcoming. It is unclear whether distributed models of, for example, face naming or orthography-to-meaning mapping, particularly those with attractor networks employed to ``clean up'' their output, exhibit this sort of flexibility, rather than attaching spurious names to unfamiliar faces, spurious meanings to nonwords or spurious pronunciations to unpronounceable letter strings.

Are networks that automatically show generalization the most appropriate choice for implementing irregular mappings such as that between orthography and semantics? Forster (1994) suggests not, while McRae, de Sa and Seidenberg (1997), in rejecting Forster's pessimism, note that

``feedforward networks...can learn arbitrary mappings if provided with sufficient numbers of hidden units. Networks that are allowed to memorize a set of patterns sacrifice the ability to generalize, but this is irrelevant when the mapping between domains is arbitrary'' (p. 101).

McRae et al., however, do not show that ``sufficient numbers of hidden units'' would be significantly less than one for each word (i.e., an easily learned localist solution); and, even so, it is not clear what advantages such a distributed mapping would exhibit when compared with a localist lexical route, given that generalization is specifically not required. With regard to Forster's questions about the spurious activation of meaning by nonwords, McRae et al.'s simulations used a Hopfield-type model, with restrictions on data collection allowing them to model the learning of only 84 orthography-to-semantic mappings. A test of the resulting network, using just 10 nonwords, led to ``few or no [semantic] features'' being activated to criterion -- whether Forster would be persuaded by this rather qualified result is doubtful. Plaut et al. (1996) identify essentially the same problem in discussing their semantic route to reading. Their unimplemented solution involves semantic representations that are

`` relatively sparse, meaning each word activates relatively few of the possible semantic features and each semantic feature participates in the meanings of a very small percentage of words'' (p. 105)

and they add

``...this means that semantic features would be almost completely inactive without specific evidence from the orthographic input that they should be active. Notice that the nature of this input must be very specific in order to prevent the semantic features of a word like CARE from being activated by the presentation of orthographically similar words like ARE, SCARE, CAR, and so forth'' (p. 105).

Since the mapping between orthography and semantics clearly requires an intermediate layer of mapping nodes, it might seem easiest to ensure this exquisite sensitivity to orthographic input by making these mapping nodes localist lexical representations. Of course this would mean that the second route to reading was the type of localist lexical route the authors explicitly deny. It remains to be demonstrated that a genuinely distributed mapping could exhibit the requisite properties and, again, what advantages such a scheme would enjoy over the rather straightforward localist solution.

 Finally, another type of generalization is possible with localist networks, namely, generalization by weighted interpolation. In such a scheme, localist representations of various familiar items activate to a level equal to some function of the degree that they match an unfamiliar input pattern, the combined output being an activation-weighted blend of the individual output patterns associated with each familiar item. This type of generalization is most appropriate in domains in which mappings are largely regular. A similar arrangement has been postulated, using evidence derived from extensive cell recording, for the mapping between activation of motor cortical neurons and arm movements in primates (Georgopoulos, Kettner & Schwartz 1988). Classifying this so-called population coding as a type of localist representation is perhaps stretching the notion farther than necessary (cf. our earlier comments regarding orientation columns), although it really amounts to no more than acknowledging that each cell in a population will respond optimally to some (presumably familiar) direction, albeit one located in a space with continuously varying dimensions. In some cases it might even be difficult to distinguish between this weighted-output decoding of the pattern of activation across what I'll call the coding layer and an alternative decoding strategy which imagines the cells of the coding layer as a set of localist direction-nodes racing noisily to a criterion, with the winner alone driving the associated arm movement.2A similar distinction has been explored experimentally by Salzman and Newsome (1994), who located a group of cells in rhesus monkey MT cortex, each of which responded preferentially to a given direction of motion manifested by a proportion of dots in an otherwise randomly moving dot pattern. The monkeys were trained on a task which required them to detect the coherent motion within such dot patterns and to indicate the direction of motion by performing an eight-alternative forced choice task. Once trained, the monkeys were presented with a pattern containing, for example, northerly movement while a group of cells with a preference for easterly movement was electrically stimulated to appropriate levels of activation. The responses of the monkeys indicated a tendency to respond with either a choice indicating north or one indicating east, rather than modally responding with a choice indicating the average direction north-east. The authors interpreted these results as being consistent with a winner-takes-all rather than a weighted-output decoding strategy. Implicit in this interpretation is the monkeys' use of a localist coding of movement direction. It is likely that both decoding strategies are used in different parts of the brain or, indeed, in different brains: The opposite result, implying a weighted output decoding strategy, has been found for those neurons in the leech-brain that are sensitive to location of touch (Lewis & Kristan 1998). More germanely, generalization by weighted output can be seen in several localist models of human and animal cognition (e.g. Kruschke 1992; Pearce 1994).

To summarize, contrary to an often repeated but seldom justified assumption, there are (at least) three ways in which localist models can generalize: by output of the most appropriate stock response; by input decomposition and output assembly; or by activation-weighted output.

4.3.2 Attractors

 

Another much-discussed feature of networks employing distributed representations is their ability to exhibit ``attractor'' behaviour. In its most general sense (the one I shall adopt here) attractor behaviour refers to the ability of a dynamic network to relax (i.e. be ``attracted'') into one of several stable states following initial perturbation. In many content addressable memory networks, such as that popularized by Hopfield (1982; 1984), the stable states of the network correspond to previously learned patterns. Such attractor networks are often used to ``clean up'' noisy or incomplete patterns (cf. generalization). In mathematical terms, a learning algorithm ensures that learned patterns lie at the minima of some function (the Lyapunov function) of the activations and weights of the network. The activation-update rule ensures that, from any starting point, the trajectory that the network takes in activation space always involves a decrease in the value of the Lyapunov function (the network's ``energy''), thus ensuring that eventually a stable (but perhaps local) minimum point will be reached. Cohen and Grossberg (1983) describe a general Lyapunov function for content-addressable memories of a given form, of which the Hopfield network is a special case.

To see how attractor behaviour (again in its general sense) can be exhibited by a variant of the localist network described above, assume that we have a two-layer network, as before, in which the upper layer, $L_{2}$, acts as a dynamic, competitive, winner-takes-all layer, classifying patterns at the lower layer, $L_{1}$. Let us further assume that $L_{2}$ nodes project activation to a third layer, $L_{3}$, the same size as $L_{1}$, via connections whose weights are the same as those of the corresponding $L_{1}$-to-$L_{2}$ connections (see Figure 7). For simplicity, let us assume that the input threshold, $\theta$ is zero. On presentation of an input pattern at $L_{1}$, the inputs to the $L_{2}$ nodes will reflect the similarities (e.g. dot products) of each of the stored weight vectors to this input pattern. If we track the trajectory of the activation pattern at $L_{3}$ as the competition for activation at $L_{2}$proceeds, we will find that it starts as a low-magnitude amalgam of the learned weight vectors, each weighted by its similarity to the current input pattern, and ends by being colinear with one of the learned weight vectors, with arbitrary magnitude set by the activation of the winning node. In the nondeterministic case, the $L_{3}$pattern will finish colinear with the weight vector associated with the $L_{2}$ node receiving the largest noisy input, $I_{j}$. Thus the $L_{3}$ activation vector is attracted to one of several stable points in weight-space, each of which represents one of the learned input patterns. In the noisy case, given the results presented earlier, the probability of falling into any given attractor will be describable in terms of a Luce choice rule. This is precisely the sort of attractor behaviour we require.


  

Figure 7: An attractor network.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/attractor.ps, width=4.5in}}\end{figure}


In certain cases we might allow the $L_{2}$ nodes to project back down to the nodes in $L_{1}$ rather than to a third layer, $L_{3}$. In this case (reminiscent of the ART networks referred to earlier), the activation pattern at $L_{1}$ is attracted towards one of the stable, learned patterns. This network is essentially an autoassociation network with attractor dynamics. Such an implementation has some advantages over those autoassociative attractor networks used by Hopfield and others. For instance, it should be fairly clear that the capacity of the network, as extended by competing localist representations, is, in the deterministic case, equal to the maximum number of nodes available in $L_{2}$ to learn an input pattern. In contrast with the Hopfield network, the performance of the localist network is not hindered by the existence of mixture states, or false minima, that is, minima of the energy function that do not correspond to any learned pattern. Thus localist attractor networks are not necessarily the same as their fully distributed cousins, but they are attractor networks nonetheless: whether or not a network is an attractor network is independent of whether or not it is localist.

4.3.3 Categorical Perception

Since one can view the lateral inhibitory module as performing a categorization of the $L_{1}$ activation pattern, the category being signalled by the identity of the winning $L_{2}$ node, the network can naturally model so-called categorical perception effects (see e.g. Harnad 1987). Figure 8 illustrates the characteristic sharp category-response boundary that is produced when two representations, with linear generalization gradients, compete to classify a stimulus that moves between ideal examples of each category. In essence, the treatment is similar to that of Massaro (1987), who makes a distinction between categorical perception, and ``categorical partitioning'', whereby a decision process acts on a continuous (i.e. noncategorical) percept. This distinction mirrors the one between a linear choice rule acting on representations with exponential generalization gradients and a Thurstonian choice-process acting on representations with linear generalization gradients, as seen above. The fact that Massaro describes this partitioning process in Thurstonian terms, yet models it using the Fuzzy Logic Model of Perception (Oden & Massaro 1978), serves to emphasize the strong mathematical similarities between the two approaches.


  

Figure 8: The left-hand graph shows the strength of input received by two L2 nodes, A and B, as a stimulus moves between their learned patterns. The right-hand graph shows the probabilities of choosing either node when Gaussian noise is added to the input and the node with the largest resultant input chosen. The steepness of the cross-over is determined by the signal-to-noise ratio.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/categorical_perception.ps, width=4.5in}}\end{figure}


4.4 Age-of-Acquisition Effects

Finally, I will briefly mention a specific effect which is potentially difficult to model in connectionist terms, namely the effect of age of acquisition. It has been demonstrated that subjects in word-naming and lexical decision experiments respond faster to words learned earlier in life (e.g. Morrison & Ellis 1995). This is independent of any effect of word frequency, with which age of acquisition is strongly negatively correlated. The potential difficulty in accounting for this effect with a connectionist model concerns how age of acquisition might plausibly be represented. Various schemes have been suggested to model frequency effects but age of acquisition appears to present a rather stiffer challenge, particularly for models which learn via an error-based learning rule and whose weights, therefore, tend to reflect the history of learning with a bias towards what has occurred most recently.

In suggesting a potential solution to this problem, I shall make three assumptions. First, word naming and lexical decision involve competitive processes. Second, word acquisition is also a competitive process. Third, there is some variation in the competability of nodes in a network and that the relative competability of a given node endures over time. Taking the last of these assumptions first, it is perhaps not too fanciful to talk of a node's intrinsic ability to compete and that, given such a concept, it is uncontroversial to assume that there will be some variation in the competabilities of a set of nodes. Competability might be influenced by the physical location of a node relative to its potential competitors, the breadth or strength of its lateral inhibitory connections, its ability to sustain high activations and the function relating those activations to an outgoing, lateral inhibitory signal. Given such influences it is at least possible that certain of these aspects of the node's situation endure over time, so that, for instance, a node of high competability in one time period tends to have high competability in the next. In this context we wish the time over which relative competability endures to be of the order of tens of years.

The second assumption is that the process by which a word comes to be represented in long-term memory by a given node is a competitive one. To support this assumption it is necessary to suggest some possibilities regarding the learning of localist representations in general. Earlier in this paper it was assumed that whenever learning of an entity was required -- for instance, when no committed node received super-threshold input on presentation of that entity -- an uncommitted node would come to represent that entity. How might this be achieved? Let us assume that uncommitted nodes can respond to input, whether or not the magnitude of the input they receive is above the threshold $\theta$; in other words, the threshold for uncommitted nodes is effectively zero, though the magnitude of their incoming weights may be small. Further assume that committed nodes which receive super-threshold input are able to quench significant activation at uncommitted nodes via lateral inhibitory connections. In cases where a pattern is presented under sufficiently high threshold conditions such that no committed node receives super-threshold input, numbers of uncommitted nodes will activate. If learning is presumed to be enabled under these circumstances, then let each of the uncommitted nodes adapt its incoming weights such that

 \begin{displaymath}\frac{dw_{ij}}{dt} = \lambda a_{j} (a_{i} - w_{ij})
\end{displaymath} (10)


where $\lambda$ is a learning rate, $w_{ij}$ represents the weight from the $i^{\mbox{\tiny th \normalsize }}$ $L_{1}$ node to the $j^{\mbox{\tiny th \normalsize }}$ $L_{2}$ node, and $a_{i}$ and $a_{j}$ represent the corresponding node activations. This learning rule, almost the same as that given earlier but with the additional product term $a_{j}$, is the instar learning rule (Grossberg 1972), and it simply states that the weights incoming to a given $L_{2}$ node will change so as to become more like the current $L_{1}$ activation pattern, at a rate dependent on the activation of that $L_{2}$ node. Just as for the committed nodes, the uncommitted nodes will be subject to lateral inhibition from other uncommitted nodes, thus establishing a competition for activation, and hence a competition, via Equation 10, for representation of the current pattern. Those uncommitted nodes which, either by chance, or thanks to some earlier learning, activate relatively strongly to a given pattern, will tend to change their incoming weights faster in response to that pattern and will thus accrue more activation -- there is a positive feedback loop. Eventually, the connection weights to one of the $L_{2}$ nodes become strong enough so that that node is able to suppress activation at other uncommitted nodes. At this point, that node will be deemed to be committed to its pattern, and further learning at that node will effectively be prevented. The process by which an uncommitted node competes to represent a novel pattern might be accomplished in a single presentation of a pattern (high $\lambda$, fast learning) or several presentations (low $\lambda$, slow learning).

Two things are worth mentioning with regard to this learning procedure. One is that it is compatible with the generalization procedure described earlier. On a given test trial of, say, a category learning experiment, the network might have its threshold set low enough to allow committed nodes to receive input, permitting a best-guess response to be made. If the guess is later confirmed as correct, or, more importantly, when it is confirmed as incorrect, the threshold can be raised until no committed node receives super-threshold input, allowing a competition among previously uncommitted nodes to represent the current activation pattern, with that representation's then becoming associated with the correct response. This is very similar to the ARTMAP network of Carpenter, Grossberg and Reynolds (1991). The main difference in emphasis is in noting that it might be beneficial for the network to learn each new pattern (i.e. run as an exemplar model) even when its best-guess response proves correct. The second point worth mentioning is that the learning process suggested above will result in a number of nodes which come close to representing a given pattern yet ultimately fail to win the competition for representation. These nodes will be well placed to represent similar patterns in the future and may, in, say, single-cell recording studies, appear as large numbers of cells appearing to cluster (in terms of their preferred stimulus) around recently salient input patterns.

The final assumption in this account of the age-of-acquisition effects is that word-naming depends on a competitive process similar to that described above. It is absolutely in keeping with the modelling approach adopted here to assume that this is the case. Moreover, a number of recent models of the word-naming and lexical decision processes make similar assumptions regarding competition (see Grainger and Jacobs 1996, for a review and one such model).

Age-of-acquisition effects can now be seen to be a natural feature of any system which is consistent with these three assumptions. Those nodes which have a high intrinsic competability will tend to become committed to those words encountered early, since this process is a competitive one. If competability endures, then nodes that happen to represent words acquired early will have an advantage in subsequent competitions, all else being equal. If word-naming and lexical decision engage competitive processes, words acquired early will tend to be processed faster than late words, just as the age-of-acquisition effect demands. Note that nothing needs to be known about the distribution of competabilities for this account to be true. The only requirement is that there be significant variability in these node competabilities that is consistent over time.

   
4.5 Supervised Learning

The process of pattern compression and classification described so far is an unsupervised learning mechanism. This unsupervised process effectively partitions the space of input patterns into distinct regions on the basis of pattern similarities. By contrast, supervised learning involves the learning of pattern associations, this term extending to a wide variety of tasks including stimulus-response learning, pattern labelling and binary (e.g. yes/no) or multiway decision-making. Pattern association is, of course, the domain of application of the most common of the PDP networks, namely, the multilayer perceptron trained by backpropagation of error (henceforth abbreviated as BP network, Rumelhart, Hinton & Williams 1986). In the framework developed here, supervised learning is a simple extension of the unsupervised classification learning described previously. Essentially, once two patterns that are to be associated have been compressed sufficiently so that each is represented by the super-threshold activation of a single, high-level node, then the association of those two nodes can proceed by, for example, simple Hebbian learning. (Indeed, one might even view supervised learning as autoassociative learning of the amalgam of the compressed, to-be-associated patterns, permitting subsequent pattern-completing attractor behaviour.) Geometrically speaking, the classification process orthogonalizes each of the patterns of a pair with reference to the other patterns in the training set, the subsequent association between those patterns being trivially accomplished without interfering with previously acquired associations. The general scheme is shown in Figure 9 and is functionally almost identical to the ARTMAP network developed by Carpenter, Grossberg and Reynolds (1991), as well as to many other networks (e.g. Hecht-Nielson 1987; Burton, Bruce & Johnston 1990; McLaren 1993; Murre 1992; Murre et al. 1992). It is also functionally equivalent to a noisy version of the nearest-neighbour classification algorithm used in the machine-learning community, and structurally equivalent to more general psychological models including the category learning models described in previous sections, and other models such as that proposed by Bower (1996).


  

Figure 9: A generic network for supervised learning.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/artmap.ps, width=4.5in}}\end{figure}


The operation of the network is simple. When the activation of a given node in one of the competitive layers, $L_{2A}$, hits a race-winning criterion, $\chi$, it can excite one of the nodes in the mapping layer (see Figure 9). (I assume that once a node in a given layer hits its criterion, other nodes in the layer are prevented from doing so by, for instance, a broadly applied inhibitory signal, or by raising of the criterion.) Assuming that a similar process occurs in the other competitive layer, $L_{2B}$, the active map node can then be associated by the $L_{2B}$ winner by simple Hebbian learning. On subsequent presentation of one of the associates, driving, say, $L_{2A}$, a given classification node will reach criterion, and will activate its map-layer node, in turn activating the $L_{2B}$ node corresponding to its associate. This would allow a relevant response to be made (perhaps by top-down projections from $L_{2B}$ to $L_{1B}$). The division between the two halves of the network is appropriate when considering cross-modal associations, but will not be so clearly appropriate when associations are required between two items within a modality, for example, between two visually presented words. In this case, processes of selective attention might be employed, so as to classify one word and then the other; they will generally be competitors (cf. Kastner, De Weerd, Desimone, Ungerleider 1998) and hence cannot both win a given race to criterion at the same time. The identity of the first word can be stored by sustained activation at the map layer, while attention is transferred to recognition of the second word. When recognition of the second word is accomplished, associative learning can proceed as before. Alternatively, one might propose a scheme whereby $L_{2}$nodes responding to different objects currently present in the world might be permitted to coactivate to criterion (i.e. not compete; see the earlier discussion of binding), on the basis that they are grouped (or ``streamed'') separately, with subsequent association being achieved, as before, via the mapping layer.

The provision of the mapping nodes allows a good deal of flexibility in the associations that can be made. The mapping layer can be configured to allow one-to-one, many-to-one or one-to-many mappings. Moreover, under certain circumstances, in particular when at least one of the associates is learned under low-vigilance (cf. prototype) conditions, remapping of items to alternative associates can be quickly achieved by rapid reconfiguration of connections to and from the mapping layer. The low-vigilance requirement simply acknowledges that flexible remapping of this kind will be difficult to achieve under conditions in which both ``sides'' of a given set of associations are exhaustively exemplar coded, that is, when each association-learning trial engages two new exemplars (previously uncommitted nodes) linked via a newly established mapping-layer connection.

Such a scheme for supervised learning of pattern associations enjoys a number of advantages over alternative schemes employing distributed representations throughout, such as the BP network.

1.
The learning rate can be set to whatever value is deemed appropriate; it can even be set so as to perform fast, one-pass learning of sets of pattern associations. The BP algorithm does not allow fast learning: learning must be incremental and iterative, with the learning-rate set slow enough to avoid instabilities. The learning time for backpropagation thus scales very poorly with the size of the training set. By contrast, for the localist model, total learning-time scales linearly with the size of any given training set, with subsequent piecewise additions to that training set posing no additional problem.
2.
Localist supervised learning is an ``on-line'' process and is self-organizing, with the degree of learning modulated solely by the variation of global parameter settings for vigilance and learning rate. Typical applications of BP networks require off-line learning with distinct learning sets and separate learning and performance phases (see later).
3.
The localist model is, in Grossberg's (1987) terms, both stable and plastic, whereas BP nets are not, exhibiting catastrophic interference in anything resembling ``realistic'' circumstances (see later).
4.
Knowledge can be learned by the localist network in a piecemeal fashion. For instance, it can learn to recognize a particular face and, quite separately, a name, subsequently allowing a fast association to be made between the two when it transpires that they are aspects of the same person. BP nets do not enjoy this facility--they cannot begin the slow process of face-name association until both face and name are presented together.
5.
The behaviour of localist nets is easy to explain and interpret. The role of the ``hidden'' units is essentially to orthogonalize the to-be-associated patterns, thus allowing enduring associations to be made between them. There is none of the murkiness that surrounds the role of hidden units in BP nets performing a particular mapping.

Importantly, these advantages are enjoyed without sacrificing any power to perform complex mappings that are not linearly separable (e.g. XOR, see Figure 10), or the ability to generalize (see earlier). The question arises as to why, given these advantages, there has been resistance to using localist models. This question will be addressed in the next section.


  

Figure 10: A network which performs the XOR mapping.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/xor.ps, width=4.5in}}\end{figure}


5. Some Localist Models in Psychology...

In extolling the virtues of localist connectionist models in psychology, I have occasionally encountered the belief that such models are not really connectionist models at all, this title being reserved for ``real'' connectionist models, such as those employing the backpropagation (BP) learning rule. Indeed, in some quarters it seems as if connectionist modelling and application of the backpropagation learning rule to fully distributed networks are seen as equivalent. I assume that this attitude stems from the great popularity of networks such as the BP network after the release of the PDP volumes, with accompanying simulation software, in the mid 1980s. Nevertheless, as mentioned earlier, several of the networks discussed in those volumes were localist. This suggests that bias against using localist models, or even against seeing them as connectionist at all, is not based solely on the wide availability of alternative approaches, but also on the assumption that localist models are less capable, or less ``plausible'', than these alternatives. I do not believe either of these is well-founded.

Before addressing this issue further it is worth noting that many successful models in psychology are either localist connectionist models, or, in the light of the preceding discussion, can be readily implemented as such. I do not wish to (and could not) give a complete roll-call of such models here, but, in the areas in which I have a particular interest, these include: Burton, Bruce and Johnston's (1990) model of face perception; Estes' (1986) array model of category learning, and Estes' (1972) model of ordered recall (though not necessarily Lee & Estes' 1981, later developments of it); Morton's (1969) logogen model and its variants; Nosofsky's (1986) generalized category model and the mathematical equivalents described above; Kruschke's (1992) ALCOVE model of attentional category learning; Pearce's (1994) configural model of conditioning; Hintzmann's (1986) MINERVA model; models of speech production by Levelt (1989), Dell (1986; 1988) and Hartley and Houghton (1996); Norris' (1994a) model of reading aloud and his SHORTLIST model of spoken word segmentation (Norris 1994b); The DRC model of Coltheart, Curtis, Atkins and Haller (1993); the TRACE model of word recognition (McClelland & Elman 1986); Usher and McClelland's (1995) model of the time course of perceptual choice; the models of immediate serial recall by Burgess & Hitch (1992; in press) and Page & Norris (1998); and other models of serial recall by Houghton (1990), Nigrin (1993) and Page (1993; 1994); Pickering's (1997) and Gluck and Myers' (1997) models of the hippocampus; Shastri and Ajjanagadde's (1993) model of reasoning; Hummel and Biedermann's (1992) model of object recognition and Hummel and Holyoak's (1997) model of analogy processing; Grainger and Jacobs' (1996) model of orthographic processing; Bower's (1996) model of implicit memory; and those models described in Grainger and Jacobs (1998). Furthermore, not only is the list of distinguished localist models a long one, but in cases where localist and fully distributed approaches have been directly compared with reference to their ability to explain data, the localist models have often proved superior (e.g. Coltheart et al. 1993; López, Shanks, Almaraz & Fernández 1998).

I should state that not all of these models have stressed their equivalence with localist connectionist models. Indeed, it has become common, in the concluding sections of papers which describe localist models, to apologize for the lack of ``distributedness'' and to speculate that the same performance could be elicited from a more distributed model. In an attempt to account for this occasional hesitance, I will try, in the next section, to address some of the concerns most commonly voiced in relation to localist models.

First, however, it is worth pausing briefly to ask why some researchers have preferred a localist approach to modelling. I shall take as an example Jacobs and Grainger (1994; and in Grainger & Jacobs 1996 1998), who have been most explicit in their justification of a research programme based, in their case, on the localist interactive activation (IA) model (McClelland & Rumelhart 1981; Rumelhart and McClelland 1982). They see IA as a canonical model, a starting point representing

``the simplest model within a given framework that fairly characterizes the qualitative behavior of other models that share its design and system principles with respect to the data at hand'' (p.519).

Despite the simplicity of the underlying model, they have been able to provide detailed simulations of accuracy and reaction time measures from a variety of orthographically driven tasks, contradicting earlier pessimism (e.g. McClelland 1993) about whether reaction time measures would be susceptible to accurate simulation by such networks (Grainger and Jacobs 1996). They have further identified the IA model as particularly appropriate to a strategy of nested modelling in which when the model as applied to a new set of data (in their case data concerning aspects of orthographic processing in visual word recognition) it retains its ability to simulate data sets to which it was earlier applied. The flexibility of the IA model in this regard (as well as with regard to the modelling of functional overlap and scalability -- Grainger & Jacobs 1996; 1998), is largely attributable to the technical advantages of localist modelling discussed in Section 4.5, thus highlighting an important interaction between the choice of model-type and the scientific methodology that is adopted in applying that model. As Jacobs and Grainger (1994) pointed out, not all developments in connectionist modelling have respected this constraint on backwards compatibility. For example they cite the failure of Seidenberg and McClelland's (1989) model of reading to account explicitly for the word superiority effect, the simulation of which had been a staple target of the previous generation of models in the area. Although it is possible that networks using thoroughgoing distributed representation will be shown to be capable of flexible, scalable, nested modelling of functionally overlapping systems, this has not yet been so clearly demonstrated as it has been for the localist competitors to such models.

6. Why Might Some People Be Reluctant To Use Localist Models in Psychology

This section covers, in a little more detail, many of the issues raised (and countered) by Thorpe (1995) in relation to common arguments used against localist models.

6.1 ``Distributed representation is a general principle''

Perhaps the most fundamental reason for choosing a fully distributed modelling approach over a localist one would be the belief that distributed representation is simply a general principle on which the enterprise of connectionist modelling is founded. Such a view was clearly stated by Seidenberg (1993) who gave as the first of his general connectionist principles

``Knowledge representations are distributed [distributed representations of orthography and phonology]'' (p.231)

the bracketed comment referring to the way in which this principle was realized in the Seidenberg and McClelland (1989) model of reading. This enshrinement of distributed representations (assuming it is intended to imply a rejection of localist representation) is not only historically inaccurat -- thoroughgoing distributed representation never having been a necessary feature of a connectionist model -- but it is also rather ironic. The irony stems from the fact that in the later, improved version of the reading model (Plaut et al. 1996), orthography and phonology (though not the lexicon) were represented locally, as indicated previously.

6.2 ``They don't generalize and/or are not efficient''

As noted above, the fact that fully distributed networks can generalize is sometimes taken to imply that localist networks cannot. I hope I have shown above that this is not the case. The wider issue of generalization is discussed in detail in Hinton, McClelland and Rumelhart (1986), in the section entitled ``Virtues of Distributed Representations.'' It is interesting to note that the introduction to this section states that ``Several of these virtues are shared by certain local models, such as the interactive activation model of word perception, or McClelland's (1981) model of generalization and retrieval'' The virtue of generalization is not confined to fully distributed models.

The chief virtue that Hinton, McClelland and Rumelhart (1986) attribute to fully distributed networks, but deny to localist networks, is that of efficiency. They conclude that certain mappings can be achieved, using fully distributed networks, with far fewer hidden units than are used by the corresponding localist network. This is true and, in this restricted sense, the distributed networks are more efficient. The following three points are noteworthy, however.

1.
This notion of efficiency will count for nothing if the process by which the mapping must be learned is not only inefficient but also rather implausible. This point relates both to the disadvantages of ``distributed learning'' raised above, and to the later discussion of catastrophic interference.
2.
The localist solution enjoys advantages over the distributed solution quite apart from its ability to perform the mapping. These relate to the comprehensibility of localist models and the manipulability of localist representations and will be discussed later.
3.
More generally, efficiency in modelling, particularly when arbitrarily defined, is not necessarily an aim in itself. A lexicon of 100000 words could be represented by the distinct states of a 17-bit binary vector -- very efficient but not very plausible as a psychological model. In terms of modelling neural function, it is at least conceivable that the brain has arrived at computationally effective but representationally ``inefficient'' solutions to certain problems.

6.3 ``They do not degrade gracefully''

Another advantage often claimed for fully distributed networks is that they continue to perform well after damage, usually considered as loss of nodes or weights. This quality is sometimes termed ``graceful degradation''; similar effects are usually tacitly assumed to occur in real brains in response to damage. By contrast, it is implied, localist models do not degrade gracefully, since loss of a given node will render its referent unrepresented. This is true, but only in a strictly limited sense. First, it should be repeated that localist models use distributed/featural representations at ``lower'' levels -- the network will degrade gracefully in response to any loss of nodes at these levels, just as it is able to generalize to new or incomplete input. Second, localist models do not preclude redundancy. There may be many nodes that locally represent a given entity -- indeed, in the exemplar models discussed above, this is very likely to be the case. Thus, loss of a given node will not necessarily leave its associated entity unrepresented (although in the model developed earlier reaction time will increase and accuracy will diminish). By way of example (expanded slightly from Feldman 1988), suppose the brain has $10^{11}$ neurons and these are being lost at a rate of $10^5$ per day, the chance of losing a given cell in a 70-year period is approximately 0.03. If we assume a small amount of redundancy in representation, say, 5 cells per entity, then the probability of leaving a given entity unrepresented in the same period is, assuming independence, $10^{-8}$. I would guess that this is somewhat less than the probability of losing one's entire head in the same period; hence it would not seem an unreasonable risk. In this regard, it is important to note that a number of independent localist representations do not amount to a distributed representation.

It is worth asking whether humans ever seem to have lost their ability to represent highly specific entities (presumably via focal damage rather than by gradual wastage). Howard (1995) describes an aphasic patient who appears to have lost ``specific lexical items from a phonological lexicon for speech production'' (though see Lambon-Ralph, 1998, for an alternative view, albeit of a different patient). A particularly interesting feature of these data is that the naming accuracy for given words is ``not demonstrably related to the availability of either their phonological or their semantic neighbours''. While it is unwise to claim that this pattern of results could never be modelled with a fully distributed system, it is certainly more suggestive of a system based on locally represented lexical entries.

6.4 ``There are not enough neurons in the brain and/or they are too noisy''

Any assertion to the effect that are too few neurons in the brain to permit localist representations presumes answers to two questions: How many neurons/functional units are there? And how many are needed? Assuming that the answer to the second question does not err in requiring the brain locally to represent all possible percepts rather than some actual percepts (an error analogous to requiring a library to have sufficient capacity to store the vast number of possible books as opposed to the comparatively minuscule number of actual books), then perhaps speculating about insufficient capacity underestimates the answer to the first question. Most estimates put the number of cells in the brain at around $10^{11}$. Mountcastle (1997) estimates that the number of cells in the neocortex alone is approximately $3 \times 10^{10}$. Even if one considers the number of cortical minicolumns rather than cells, the number is in the vicinity of $5 \times 10^8$. Similarly, Rolls (1989) cites a figure of $6 \times 10^6$ cells in area CA1 of the hippocampus, an area he proposes is responsible for the storage of episodic memories. These are large numbers and they seem to place the burden of proof on those who wish to claim that they are not large enough to allow successful local coding. Furthermore, proponents of a distributed approach would presumably have to allocate not just a node, but rather a whole attractor to each familiar item in memory. Since in most nonlocalist attractor networks the limit on the number of distinct attractor basins is smaller than the number of nodes, it is not clear what is gained in potential capacity by moving from a local to a distributed coding scheme.

With regard to the assertion that neurons (not to mention nodes) might be too noisy to allow small numbers of them to perform significant coding, I follow Thorpe (1995) in citing Newsome, Britten and Movshon (1989) and hence Britten, Shadlen, Newsome and Movshon (1992), who measured the activity of relevant MT cortex neurons while a monkey performed a psychophysical discrimination task. They found that the ``performance'' of certain individual neurons, assessed by placing a discriminant threshold on their activity, was just as good as the performance of the monkey. In other words, the monkey had no more information than could be derived from the activity of single cells. Barlow (1995; and in his seminal paper Barlow 1972) makes similar points and reviews other evidence regarding the sensitivity and reliability of single neurons.

   
6.5 ``No one has ever found a grandmother cell''

The final complaint against localist representations, again taken from Thorpe (1995), concerns whether such representations have ever been found in real brains. I hardly need point out that the assertion in the heading is poorly worded, in that not having found a grandmother cell is not necessarily the same as not finding a localist representation, depending on how one chooses to define the former. Apart from this, the whole question of what would constitute evidence for, or more particularly against, localist representation seems to have become extremely confused. A review of the neuroscientific literature reveals that much of this confusion comes from poor use of terms and model nonspecificity. This review has necessarily been rather cursory, and space restrictions require even more cursory reporting in what follows.

6.5.1 Interpreting Cell Recordings

First, in one sense, the assertion in heading 6.5, even as worded, is not necessarily true. Figure 11 shows a finding of Young and Yamane (1993), who measured the responses of various cells in the anterior inferotemporal gyrus and the superior temporal polysensory area to images of the disembodied heads (!) of Japanese men in full face. The figure shows responses of one of the AIT cells which responded extraordinarily selectively to only one of the 20 faces. This was the only one of the 850 studied cells to respond in this highly selective manner. Nevertheless, the finding is interesting, since this cell is not just a localist representation, but apparently a grandmother cell (or rather a ``particular-Japanese-man cell''). Young and Yamane state, quite correctly, that they cannot conclude that this cell responds to only one stimulus, since only a small number of stimuli (albeit with, to my eye, a high interstimulus similarity) were presented. But this proviso cannot conceal the fact that no better evidence could have been found in this experiment for the existence of at least one localist representation sufficiently tightly focussed to be termed a grandmother cell. Of course one might claim that better evidence for grandmother-cell representation in general would have been provided if all 850 cells had responded above baseline for one and only one of the faces. This is true, but such a finding would be enormously unlikely, even if each of the 20 individuals was represented in this extreme manner. Purely as an illustration, suppose that 100000 cells in the relevant brain region were dedicated to face representation, with 5 extreme grandmother-cells dedicated to each of the 20 stimulus subjects. This would imply a probability of one in a thousand of discovering such a cell on a given recording trial -- approximately the probability with which such a cell was in fact found. I do not wish this illustration to be interpreted as indicating my belief in extreme grandmother-cell representation. That is not necessary to my more general defence of localist representation. I simply intend to urge caution in the interpretation of cell-recording data.


  

Figure 11: Data taken from Young & Yamane (1993), showing the response of a single cell in the inferotemporal cortex of a macaque monkey to a number of face stimuli. Spiking rates are measured relative to baseline response.
\begin{figure}
\centerline{\epsfig{file = /home/agena10/mikep/Papers/Localist/SecondRevisionFiguress/young_yamane.ps, width=4.5in}}\end{figure}


The previous paragraph highlights one aspect of a more general problem with the antilocalist interpretations that have been put on some single-cell-recording studies. This involves misconceptions about what to expect if one measured cell responses in a localist network. There seems to be a widespread tendency to assume that if a number of cells activate for several hundred milliseconds following the presentation of any given stimulus, with different degrees of activation for different stimuli, then this speaks against the idea of a localist representation. It does nothing of the sort, although this fact is often obscured in passages such as the following:

``Even the most selective face cells discharge to a variety of individual faces and usually also discharge, although to a lesser degree, to other stimuli as well. Thus, faces are presumably coded in exactly the same way as everything else, namely, by the firing pattern of ensembles of cells with varying selectivity rather than of individual cells acting as complex feature detectors.'' (Gross 1992, p.6) ``...neurons responsive to faces exhibited systematically graded responses with respect to the face stimuli. Hence each cell would systematically participate in the representation of many faces, which straightforwardly implies a population code.'' (Young & Yamane 1992, p.1330).

Statements such as these are widespread and often used to argue against localist coding. What such arguments seem to miss, however, is the potential compatibility between distributed processing and localist representation discussed earlier. (They also often miss the compatibility between distributed representation at one level and localist representation at another, but I shall not dwell on that here.) Thinking back to the localist competitive network described earlier, a broad degree of activation (i.e. activation across a potentially large number of competing nodes, particularly if the input threshold, $\theta$, is low) would be expected in response to any given stimulus, even if only one unit were eventually to reach criterion, $\chi$, and/or win a competition for sustained activation. The broad pattern of activation would be different for different stimuli, just as described in the passages quoted above (and in the earlier discussion on sparse distributed representations). That grandmother cells (let alone localist representations) would be ``signaling only one face and responding randomly to others'' (Young & Yamane 1992, p. 1329, my emphasis) is not what would be expected on the basis of any workable localist model. In summary, even if we ignore the occasional tightly tuned cell, the finding of broadly distributed (though often transient) response to a stimulus does not rule out localist representation; indeed it is fully consistent with it.

A similar argument applies to the measurement of the informational content of particular neural firing responses performed by, for instance, Rolls, Critchley and Treves (1996). Among other things, they show that on presentation of a variety of stimuli, the response of a given neuron will convey a lot of information about the identity of an individual stimulus if its firing rate for that stimulus is unusually high or unusually low relative to its responses to the other stimuli. This is perfectly consistent with a localist coding. Suppose there exists a person-A node in the sort of localist network described earlier. Suppose one then presents eight persons to the network for identification, such that most of these persons share some of the features of person-A, only one (person-A herself) shares all of those features, and one person, say person-H, is unusual in sharing no features whatsoever with person-A (e.g. he looks nothing like person-A). On presentation of each of persons A-H, therefore, the person-A node will fire particularly strongly (superthreshold) to person A, and particularly weakly to person-H, with intermediate responses to the other stimuli. Thus, the response to person-H will contain plenty of information (i.e. ``this person looks nothing like person-A''), without any suggestion that the information it contains is of active benefit to the system in its identification task. In this situation it might also be found that the information contained in the firing of a given neuron is low when averaged across stimuli (as has been found experimentally), since this average is dominated by intermediate responses to many stimuli.

An abiding problem has been that terms such as localist, distributed, population coding, ensemble coding, etc., have been used without considering the range of models to which they might refer. This has led to interpreting data as supporting or refuting certain types of model without due consideration of the predictions of specific instantiations of each type of model. In many cases, researchers have concluded that some sort of ``population coding'' exists, but have failed to specify how such population coding operates so as to allow relevant tasks to be performed. For example, it is easy to hypothesize that colour and shape are each population coded, but how does this permit the learning of one response to a green triangle or a red square and another response to a red triangle or a green square, analogous to the classic XOR problem that is a staple of connectionist modelling? Again echoing an earlier point, how does one recall that it was a yellow Volkswagen that you witnessed speeding away from the scene of a bank raid? Simply positing population coding is not enough if there is no semipermanent way to tie the individual components of a percept together so as to form a unitized memory.

One answer to these questions by Rolls (1989), illustrates clearly one of the terminological problems which further confuse the literature. In describing the role of the hippocampus in episodic memory, Rolls describes a hierarchical system, culminating in area CA1 of the hippocampus thus

``It is suggested that the CA1 cells, which receive these groups of simultaneously active ensembles, can detect the correlations of firing which represent episodic memory. The episodic memory in the CA3 cells would thus consist of groups of active cells, each representing one of the subcomponents of the episodic memory (including context), whereas the whole episodic memory would be represented not by its parts, but as a single collection of active cells, at the CA1 stage.'' (Rolls 1989, p.299)

This conclusion is supported by a wealth of data and bolstered by references to the localist connectionist literature (e.g. Grossberg 1982; 1987). Yet when it comes to the paper's conclusion, we have:

``Information is represented in neuronal networks in the brain in a distributed manner in which the tuning of neurons is nevertheless not very coarse, as noted for the above hippocampal neurons...'' (p.305)

This isolated statement is, on one level, true, but it completely deemphasizes the localist character of the conclusions reached throughout the paper. The target article is one attempt to clarify such issues of model taxonomy.

6.5.2 Evidence Consistent with Localist Cortical Coding

Positive evidence of localist coding of associations in cortex, as opposed to hippocampal structures, has been provided by Miyashita and colleagues (e.g. Higuchi & Miyashita 1996; Miyashita 1993; Sakai & Miyashita 1991; Sakai, Naya & Miyashita 1994). Sakai and Miyashita trained monkeys in a paired-association task, using 12 computer-generated paired patterns. They found task-related neurons in anterior inferotemporal (AIT) cortex which responded strongly to one or the other of the patterns in a given associated pair but weakly to any of the 22 other patterns. This occurred in spite of the fact that at least some of other patterns were quite similar to one or other of the associated pair, with the two paired patterns being at least as distinct from each other as from the remainder of the set. In a further study, Higuchi and Miyashita showed that lesioning the entorhinal and perirhinal cortex caused a loss both in the knowledge of those associations learned prelesion, and in the postlesion ability of the monkeys to learn new paired associations. The lesion had no effect on the response selectivity of cells in AIT cortex to single images from the 24-image set. The authors speculated that projections up to and back from perirhinal and entorhinal cortex permitted associations to be learned between images which are already selectively represented in AIT cortex (cf. Buckley & Gaffan 1998). This idea is strikingly similar to learning of associations between locally represented entities through projections to and from a map-layer (e.g., ARTMAP; Carpenter, Grossberg & Reynolds 1991). It is also compatible with Booth and Rolls's (1998) recent discovery of both view-specific and view-invariant representations of familiar objects in IT cortex and with the more general idea that simple tasks, such as item recognition can be mediated by brain areas separate from the hippocampus (Aggleton & Brown in press).

Perhaps the most crucial part of this series of experiments was carried out by Sakai, Naya and Miyashita (1994), concerning what the authors called the fine-form tuning of each of the AIT neurons. They are unusual in making it clear that ``one may mistakenly conclude that the most effective form in a screening test, is the optimum form for a recorded cell.'' In other words, finding that a cell responds most strongly to item D when tested with items A, B, C and D does not imply that item D is the optimal stimulus for that cell, but only that it is the best of those tested. They circumvented this potential problem by using, in their visual-pattern pair-association task (as above), patterns generated from Fourier descriptors, such that continuous variation of a small number of parameters could generate continuous transformations of each member of the trained pattern set. These transformed patterns were always much closer in parameter space to the original patterns than the original, randomly parameterized patterns were to each other. For each recorded neuron the authors identified the original pattern (from a total of 24 on which the monkeys had been trained) which elicited the strongest response. Given the large degree of pattern variation in this screening set, and thus the relatively broad nature of the cell selection process, there was no guarantee that each cell so selected would respond more strongly to its corresponding trained pattern than to fine-grained transformations of that pattern. Nonetheless, this was exactly what was found. In the majority of cases, presenting the transformed patterns resulted in a weaker response; in no case was the response to the transformed pattern stronger than that to the original learned pattern. This implies that the single-cell response is tuned to, or centred on, the particular visual pattern learned. Such a result is difficult to explain in terms of population coding unless one assumes that individual members of the active population of cells are tuned, by experience, to give a maximum response to a particular learned pattern -- but such an account is not just similar to a localist account, it is a localist account. I should note that a similar experiment was performed by Amit, Fusi and Yakovlev (1997) and although they report cell-recording results from a single cell which slightly increases its response to a degraded version of a previously trained visual pattern, they indicate that, on average, the IT cells from which recordings were elicited showed a decrease in response to degraded versions of the trained patterns, consistent with the results of Sakai et al.(1994).

7. So What's Wrong with Using Distributed Representations Throughout?

So far my emphasis has been on demonstrating the benefits of an approach to modelling that uses localist representations in addition to featural/distributed representations. This pro-localist, rather than anti-distributed stance, has been quite deliberate. Nonetheless, it risks being interpreted as indicating equanimity in the selection of a modelling approach. To counter this interpretation I shall briefly outline some of the reasons the thoroughgoing distributed approach seems less promising. Owing to space limitations, I shall refer to other work for the detail of the some of the arguments. In referring to distributed modelling techniques, I shall take as my example the backprop (BP) network. This is perhaps unfair, for its deficiencies do not necessarily apply to all fully distributed approaches (for a brief discussion of a rather different class of networks, viz. Hopfield-type attractor networks, see Section 4.3.2). Nevertheless BP has been a dominant approach in connectionist modelling in psychology over the last decade, and is hence the most obvious candidate for an illustrative example. Before outlining my objections to BP and its relatives I should sound a note of caution. I do not intend my criticisms to be taken as an attempt to devalue the scientific contribution made by the body of work built around BP and distributed representations. In many cases, such as in the fields of reading and past-tense learning, the theorizing of PDP-style connectionists has stimulated considerable debate and has forced a reconsideration of long-held views, not least by elegantly demonstrating, via simulation, the falsity of certain (though not all) opposing claims. The fact that many of these debates are yet to be resolved is testament to the potency and value of the scientific challenge posed by this brand of eliminative connectionism.

7.1 The Stability-Plasticity Dilemma a.k.a. Catastrophic Interference

The stability-plasticity dilemma (Grossberg 1987) refers to the need for a learning system to be both stable, in the sense that it protects what it has learned from overwriting, and plastic, in that it remains capable of new learning. Grossberg (1987) offered principled reasons why the BP algorithm, unlike certain localist models, fails to solve the stability-plasticity dilemma. McCloskey and Cohen (1989) identified the same problem in a simulation of association learning and referred to the lack of stability as an exhibition of ``catastrophic interference.'' Essentially the same phenomenon was noted by Ratcliff (1990). There has been a good deal of work on the subject since (e.g. French 1991; 1994; Lewandowsky 1991; McRae & Hetherington 1993; Murre 1992; Sharkey & Sharkey 1995; Sloman & Rumelhart 1992) most of which has concluded that in order to reduce catastrophic interference one must reduce the overlap between the hidden-unit representations that intervene between particular associated pattern-pairs. This is, of course, exactly what is achieved by using localist representations as intermediates (for a review and a look-up model similar to that proposed here, see Sharkey & Sharkey 1995).

The problem of catastrophic interference occurs in backprop networks as a result of the gradient-descent learning procedure. At any point during learning the network weights are being changed so as to follow a descending trajectory on an error surface in weight-space. The problem occurs because the shape of this error surface only depends on the patterns in the current learning set -- indeed, the network can only move appropriately in weight space by waiting until it has sampled each member of the current training set before making an ``amalgamated'' move. A consequence is that this error-reducing move does not take into account previously learned training sets. The only way it can do so is by accumulating training sets, so that new training sets are interleaved with all previous training sets. Learning in such networks is therefore ``off-line'' at two levels: first, any training set must be presented a large number of times, with small weight changes each time, for the relevant mapping to be stably learned; second, to avoid overwriting, previous training sets must be interleaved with the current set.

Since the problem of catastrophic interference has been well described elsewhere (references above), I shall not describe it further. Rather, I would like to make some observations regarding a proposal advanced by McClelland, McNaughton and O'Reilly (1995) that has been taken by some as mitigating the problem of catastrophic interference with reference to brain function, and hence enhancing the plausibility of fully distributed modelling. Their proposal is that the hippocampus permits fast learning of pattern associations on-line, subsequently allowing these associated patterns to be replayed to a fully distributed neocortical learning system off-line, perhaps during sleep. The presentation of this hippocampally generated material to the neocortical system is effectively interleaved with patterns derived from continuing exposure to the environment and other patterns ``reactivated'' from among those already stored in neocortex. The neocortical system is supposed to be sufficiently slow-learning to avoid catastrophic interference under these conditions.

The idea of such memory consolidation has its roots in proposals by Marr (1970; 1971) and Squire, Cohen and Nadel (1984); McClelland et al. add a computational flavour by suggesting that the dual-store system has evolved in this way so as to finesse the interference problems of distributed learning systems. There are several points to be made regarding this account.

1.
For McClelland et al.'s proposal to be viable, the hippocampal system must be able to learn pattern associations on-line, with minimal interference. They achieve this by the ``use of sparse, conjunctive coding in the hippocampus...[such that] representations of situations that differ only slightly may have relatively little overlap''. In other words, in order to support a fully distributed system at the neocortex, they assume what is effectively a localist system in the hippocampus. This rather weakens any argument in principle against localist representations.
2.
In their description of the function of the dual-store system, McClelland et al. tend to confound the idea and benefits of slow learning with those of slow, interleaved learning. Although slow off-line consolidation of associations learned by a fast on-line system is appealing, this is regardless of whether what is learned in the fast system is interleaved with what is already present in the slow system. That is, the benefits of a dual-store system are quite independent of whether interleaved transfer is carried out from one to the other, as McClelland et al. propose. A dual-store system, with a fast system learning individual, contextualized episodes, and a slow system maintaining more enduring, context-free representations (analogous to the exemplar/prototype distinction described earlier) only demands interleaved learning if the slow system is prone to catastrophic interference.
3.
Following from the previous point, one must be wary of a superficially tempting train of thought that runs like this: for a fully distributed neocortical system to avoid catastrophic interference, it must be supplemented by a fast, localist system; there exists a fast, localist system, embodied in the hippocampus; therefore the slow neocortical system is fully distributed. This logic is clearly erroneous, since the existence of a localist system in the hippocampus says nothing about whether the neocortical system is localist or fully distributed in nature. Both the fast and the slow systems might be localist, thus eliminating the problem of catastrophic interference in the neocortex without resorting to the complexities of interleaved learning.
4.
Last, and perhaps most important, part of the putative mechanism of interleaved consolidation seems to be inadequate. McClelland et al. maintain that new patterns stored in hippocampus are potentially interleaved both with patterns encountered during continuous exposure to the environment and with other patterns previously learned by neocortex. The former (i.e., those items which are continuing to be represented environmentally) will presumably be hippocampally stored and consolidated anyway, so their interleaving can be accomplished either indirectly via the hippocampus or directly from the environment. The problem concerns the source, for interleaving purposes, of those old patterns which are no longer represented in hippocampus, but which are stored solely in neocortex (e.g., those patterns which are hypothesized to survive hippocampal damage in retrograde amnesia). The dilemma is this: How can those patterns associations stored in neocortex be used to train neocortex? There is a basic problem here: an error-based learning system, such as the one proposed, cannot teach itself. This would be rather like asking someone to mark their own homework. First, if the neocortical system is imagined as a trained BP network (or similar), it is unclear how one can extract from the network a representative sample of the input patterns on which it was trained, so that these might be interleaved during training with the new hippocampal patterns. Second, even if one could generate the relevant input patterns, it is unclear how the network could then, given one of those input patterns, generate both an output pattern and a different target pattern, as is required for gradient-descent learning. As these two patterns, if they are both to be generated by the neocortical system, will be the same, there will never be any error term to backpropagate and hence no learning. The old neocortical patterns will remain effectively unconsolidated and hence vulnerable to catastrophic interference.

The only way out of this problem seems to be to find some way of accurately sampling the neocortical store prior to any perturbing inputs from the hippocampus so as to generate a training set of input and target patterns which can be (fast) stored in some other system and appropriately used for slow, interleaved learning in the neocortex. Such a scheme has not yet been proposed, although Robins (1995) and French (1997) have suggested similar schemes, whereby a smaller but somehow representative set of pseudopatterns is loaded back from the slow system to the fast system (i.e., presumably, the hippocampus) so that the neocortical training set comprises a hippocampally generated mixture of these pseudopatterns with recently acquired patterns. Disregarding the fact that such schemes seem to provide less than solid protection to old memories (with 20-30% loss after only 10-20 new intervening patterns, often using more pseudopatterns than original pattern pairs), they also seem to imply that all knowledge, old or new, must be effectively located in a fast-learning system (the hippocampus?), with the older knowledge also stored neocortically. Although this is construable as consistent with evidence from animals and humans with hippocampal damage, it is not consistent with recent data from Graham and Hodges (1997) and Snowden, Griffiths and Neary (1996), who show preserved recent memories and impaired distant memories in patients with semantic dementia who have relative sparing of the hippocampal complex.

The foregoing paragraph illustrates some difficulties in endogenously generating, from neocortex, patterns for interleaving with hippocampally generated patterns during consolidation to neocortex. If these criticisms are accepted, avoiding catastrophic interference depends strongly on the assumption that exogenously generated patterns (more particularly, pattern pairs, encountered during ongoing exposure to the environment), will be representative of the older contents of neocortex. Note that for a localist neocortical system, or indeed for any neocortical system not prone to catastrophic interference, this constraint on the stability of the environment is not required. Hence in McClelland et al.'s approach a fully distributed neocortex demands that the environment be largely stable and the learning rate be very slow. Having arrived at this conclusion one is tempted to ask: Why bother, under these conditions, with consolidation from hippocampus to neocortex at all? Evidence for consolidation is an important component of the data evinced by McClelland et al. in support of their model, but, under conditions of a stable environment and a slow-learning neocortex, it is not clear what role consolidation plays. For example, if it is held to hasten the incorporation of new knowledge into neocortex, this will reduce the chances of old knowledge being resampled from the environment during the period over which this incorporation takes place, thus increasing the chances of interference.

7.2 Implausibility of the Learning Rule

Even if one were to disregard the associated problems of catastrophic interference and interleaved off-line learning, there are still considerable doubts about the validity of the BP learning rule as a brain mechanism. These doubts are readily acknowledged even by those most associated with the use of this technique, and this can lead to some rather curious conclusions:

``as an example, they focus on the back-propagation learning algorithm...pointing out that it is very implausible as a model of real learning in the brain...This is, of course, true...But even this glass is a quarter full: in many cases...one is not interested in modelling learning per se, and the so-called learning algorithm is used to set the weights in the network so that it will perform the tasks of interest. The term `learning' has irrelevant psychological connotations in these cases and it might be less confusing to call such algorithms `weight setting algorithms'. Unless there is some systematic relationship between the way the necessary weights are found and the aspects of model performance under study, which in general we have no reason to expect, it is harmless to use unrealistic learning algorithms'', Farah (1994, p.96)

Skipping over the fact that the whole localist-distributed debate gives us every reason to expect a systematic relationship between the means of learning and the subsequent performance of the system, it seems that, for Farah at least, one of the major advantages of connectionism over more traditional models -- that they could provide some account of how certain mappings are learned by example -- is irrelevant. Given this quote, would it be legitimate to use connectionist networks as psychological models even if it could be proved that the weighted connections in those networks could never have been acquired by a process consistent with the structure and function of the human brain? And would ``lesioning'' such networks and comparing their subsequent performance with that of a brain-injured patient still be considered appropriate, as it is now?

The question regarding the correspondence of network models and brain function is a difficult one. It is of course perfectly justified to use networks as cognitive models, disclaiming any necessary connection with actual brains. Having done so, however, one is less justified to go on to use those same networks in simulations of brain damage or in studies involving functional brain imaging. Throughout most of this paper, I have been happy to propose localist models as cognitive models; in the latter sections I hope to have conveyed some optimism that they might also be appropriate as models of brain function. The convergence of cognitive and neural models is, I suggest, a good thing, and seems more likely to emerge from a localist modelling approach than any current alternative.

7.3 The dispersion problem

I referred earlier to the so-called ``dispersion problem,'' identified by Plaut et al (1996) as the problem underlying the poor performance of the Seidenberg and McClelland (1989) model when applied to nonword reading. Their basic observation was that in Seidenberg and McClelland's distributed representation of, say, orthographic onset cluster, the fact that an L is present, with the same associated pronunciation, in the words ``Log'', ``Glad'' and ``Split'' is utterly concealed by the representational scheme used for orthography (in this case so-called Wickelfeatures). As noted earlier, Plaut et al.'s solution was to adopt a completely localist representation of input orthography and output phonology. Is dispersion a general problem with distributed representations? Moving up a hierarchical level, as it were, will we see a similar problem when we wish to represent sentences in relation to their constituent words? Suppose we have a scheme for representing the sentence ``John loves Mary'' in which neither ``John'' nor ``Mary'' nor ``loves'' are locally represented. Will the similarity between this sentence and the sentences ``Mary loves John'' or ``John loves Ann'', or the novel sentence ``John is celebrating his 30th birthday today'', be concealed by such a representational scheme? It is clearly difficult to answer this question for all such distributed schemes but the related issues of systematicity and compositionality in connectionist models are identified by Fodor and Pylyshyn (1988) as being of general concern. While local representation on its own is not sufficient to address the problems raised by Fodor and Pylyshyn, the addition of means for dynamic binding and inference (see e.g. Shastri and Ajjanagadde 1993) might come closer to providing a satisfactory solution.

7.4 Problems deciding ``when'' and ``what''

One problem with fully distributed networks which often gets overlooked in isolated simulations concerns the nature of the decision at the network's output. Let us take as an example a three-layer BP net trained on the mapping between distributed patterns representing faces and others representing their identities. The problem is twofold: how does the network indicate when it has identified a given test face? and how does the network indicate the identity itself? Suppose we activate the input pattern corresponding to a given face. Once activation has percolated through the network, a distributed pattern will be present at the output. Recognition might be signalled the moment this pattern is instated but if so, what would be the criterion by which the arrival of an output pattern might be judged? Alternatively, there might be some clean-up process acting on the output pattern which will take time to settle into an equilibrium state. But in this case, how will the network signal that it has arrived at this state? One might speculate that there is some process overseeing the clean-up system which monitors its ``energy'' and signals when this energy reaches a stable minimum. But what might be the locus of this energy-monitoring system and how might it function? (For a computational implementation of a similar scheme based on settling times, see Plaut et al. 1996.) Even supposing that the system knows when a stable state has been reached, how does it surmise which state has been reached? It cannot ``look at'' the states of each of the output nodes individually, since by definition these do not unambiguously identify the referent of the overall pattern. Thus the identification system must consider the states of all the nodes simultaneously and must generate that identity which is maximally consistent with the current output pattern. But such a system is most obviously implemented using just the sort of localist decision-process described earlier. Indeed, Amit (1989) has identified just this kind of localist ``read-out'' node as an essential adjunct to the distributed attractor networks with which he has been most concerned (Amit 1995, pp. 38-43). Advocates of fully distributed models might claim that all possible actions based on the identity implied by a given output pattern can simply be triggered by that output pattern via subsequent fully distributed networks. I cannot categorically deny this claim, though it seems rather unlikely to prove feasible in general.

This problem is often obscured in actual simulations using distributed systems because the identification process is done by the modeller rather than by the model. A typical approach is to take the distributed output pattern and to calculate which of the learned patterns it best matches, sometimes adding a Luce choice-process for good measure. It would be preferable to have this functionality built into the network rather than run as an off-line algorithm. I am not claiming that fully distributed systems cannot incorporate such functionality but I have yet to see a specific system which has successfully done so.

7.5 Problems of Manipulation

On a related note, it sometimes proves difficult to manipulate distributed representations in the same way as one can manipulate localist representations. As an example, in most models of immediate serial recall (e.g. Burgess & Hitch 1992, in press; Page & Norris 1998) it proves necessary to suppress the recall of items that have already been recalled. If the items are locally represented then this can easily be achieved by suppressing the activation of the relevant node. If the items are represented in a distributed fashion, however, such that the representations of different items overlap, it is difficult to suppress one item without partially suppressing others.

7.6 Problems of Interpretation

Fully distributed networks are very much more difficult to interpret than their localist counterparts. It is often hard to explain how a distributed network performs a given mapping task. This is not necessarily a problem for the model qua simulation but it is a distinct problem for the model qua explanatory theory. Unfortunately, space does not permit a worthwhile consideration of this point here but excellent discussions can be found in Forster (1994), Green (1998 and subsequent correspondence), Jacobs and Grainger (1994), Massaro (1988), McCloskey (1991), Ramsey (1997) and Seidenberg (1993).

8. Conclusion

This target article has sought to clarify the differences between localist and fully distributed models. It has emphasized how the difference lies not in their use of distributed representations, which are occur in both types of model, but in the additional use of local representations, which are only used by the localist. It has been shown, in general, how localist models might be applied in a variety of domains, noting their close relationship with some classic models of choice behaviour, stimulus generalization, pattern classification, choice reaction-time and power-law speed-up with practice. We have discussed how localist models can exhibit generalization, attractor behaviour, categorical ``perception'' and effects of age of acquisition. Supervised learning of pattern associations via localist representations was (re)shown to be self-organizing, stable and plastic.

We have considered a number of powerful cognitive models that are either implemented as, or are transparently implementable with, localist networks with an attempt to defuse some of the more common criticisms of such networks. Some of the relevant neuroscientific data have been surveyed along with areas in which localist models have been rejected apparently without good cause. Some neuroscientific data supportive of a localist approach have been reviewed, along with the reasons a fully distributed-modelling stance may be less promising than the localist alternatives, catastrophic interference being the most serious among several enduring problems for the fully distributed approach.

The conclusion is that localist networks are far from being implausible: They are powerful, flexible, implementable and comprehensible, as well as being indicated in at least some parts of the brain. By contrast, fully distributed networks most often used by the PDP community underperform in some domains, necessitate complex and implausible learning rules, demand rather baroque learning dynamics, and encourage opacity in modelling. One might even say that if the brain doesn't use localist representations then evolution has missed an excellent trick.

Bibliography

Aggleton, J. P. and M. W. Brown (in press).
Episodic memory, amnesia, and the hippocampal-anterior thalamic axis.
Behavioral and Brain Sciences.

Amit, D. J. (1989).
Modeling Brain Function: The World of Attractor Neural Networks.
Cambridge, UK: Cambridge University Press.

Amit, D. J. (1995).
The Hebbian paradigm reintegrated: Local reverberations as internal representations.
Behavioural and Brain Sciences 18, 617-657.

Amit, D. J., S. Fusi, and V. Yakovlev (1997).
Paradigmatic working memory (attractor) cell in it cortex.
Neural Computation 9, 1071-1092.

Barlow, H. (1972).
Single units and sensation: A neuron doctrine for perceptual psychology.
Perception 1, 371-394.

Barlow, H. (1995).
The neuron doctrine in perception.
In M. S. Gazzaniga (Ed.), The Cognitive Neurosciences, pp. 415-434. Cambridge, MA: MIT Press.

Booth, M. C. A. and E. T. Rolls (1998).
View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex.
Cerebral Cortex 8, 510-523.

Bower, G. H. (1996).
Reactivating a reactivation theory of implicit memory.
Consciousness and Cognition 5, 27-72.

Britten, K. H., M. N. Shadlen, W. T. Newsome, and J. A. Movshon (1992).
The analysis of visual motion: A comparison of neuronal and psychophysical performance.
The Journal of Neuroscience 12(12), 4745-4765.

Buckley, M. J. and D. Gaffan (1998).
Perirhinal cortex ablation impairs configural learning and paired-associate learning equally.
Neuropsychologia 36, 535-546.

Bundesen, C. (1993).
The relationship between independent race models and luce's choice axiom.
Journal of Mathematical Psychology 37, 446-471.

Burgess, N. and G. J. Hitch (1992).
Towards a network model of the articulatory loop.
Journal of Memory and Language 31, 429-460.

Burgess, N. and G. J. Hitch (in press).
Memory for serial order: A network model of the phonological loop and its timing.
Psychological Review.

Burton, A. M. (1994).
Learning new faces in an interactive activation and competition model.
Visual Cognition 1(2/3), 313-348.

Burton, A. M., V. Bruce, and R. A. Johnston (1990).
Understanding face recognition with an interactive activation model.
British Journal of Psychology 81, 361-380.

Carpenter, G. A. and S. Grossberg (1987a).
ART2: Stable self-organizationof pattern recognition codes for analog input patterns.
Applied Optics 26, 4919-4930.

Carpenter, G. A. and S. Grossberg (1987b).
A massively parallel architecture for a self-organizing neural pattern recognition machine.
Computer Vision, Graphics and Image Processing 37, 54-115.

Carpenter, G. A., S. Grossberg, and J. H. Reynolds (1991).
ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network.
Neural Networks 4, 565-588.

Carpenter, G. A., S. Grossberg, and D. B. Rosen (1991).
ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition.
Neural Networks 4(4), 493-504.

Cohen, M. A. and S. Grossberg (1983).
Absolute stability of global pattern formation and parallel memory storage by competitive neural networks.
IEEE Transactions on Systems, Man, and Cybernetics 13, 815-826.

Coltheart, M., B. Curtis, P. Atkins, and M. Haller (1993).
Models of reading aloud: Dual route and parallel-distributed-processing approaches.
Psychological Review 100(4), 589-608.

Dell, G. S. (1986).
A spreading-activation theory of retrieval in sentence production.
Psychological Review 93(3), 283-321.

Dell, G. S. (1988).
The retrieval of phonological forms in production: Tests of predictions from a connectionist model.
Journal of Memory and Language 27, 124-142.

Estes, W. K. (1972).
An associative basis for coding and organization in memory.
In A. W. Melton and E. Martin (Eds.), Coding Processes in Human Memory. Washington, D.C.: V. H. Winston.

Estes, W. K. (1986).
Array models for category learning.
Cognitive Psychology 18, 500-549.

Farah, M. J. (1994).
Interactions on the interactive brain.
Behavioral and Brain Sciences 17(1), 90-104.

Farah, M. J., R. C. O'Reilly, and S. P. Vecera (1993).
Dissociated overt and covert recognition as an emergent property of a lesioned neural network.
Psychological Review 100(4), 571-588.

Feldman, J. A. (1988).
Connectionist representation of concepts.
In D. Waltz and J. A. Feldman (Eds.), Connectionist Models and their Implications. New York: Ablex.

Fodor, J. and Z. Pylyshyn (1988).
Connectionism and cognitive architecture: A critical analysis.
Cognition 28, 3-71.

Foldiak, P. (1991).
Models of sensory coding.
Technical Report CUED/F-INFENG/TR 91, Physiological Laboratory, University of Cambridge.

Forster, K. I. (1994).
Computational modeling and elementary process analysis in visual word recognition.
Journal of Experimental Psychology: Human Perception and Performance 20(6), 1292-1310.

French, R. M. (1991).
Using semi-distributed representations to overcome catastrophic interference in connectionist networks.
In Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society, Hillsdale, NJ, pp. 173-178. Lawrence Erlbaum.

French, R. M. (1992).
Semi-distributed representations and catastrohic forgetting in connectionist networks.
Connection Science 4, 365-377.

French, R. M. (1994).
Dynamically constraining connectionist networks to produce orthogonal, distributed representations to reduce catastrophic interference.
In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, Hillsdale, NJ, pp. 335-340. Lawrence Erlbaum.

French, R. M. (1997, April).
Pseudo-recurrent connectionist networks and the problem of sequential learning.
Paper presented at the Fourth Neural Computation and Psychology Workshop, London.

Fried, L. S. and K. J. Holyoak (1984).
Induction of category distributions: A framework for classification learning.
Journal of Experimental Psychology: Learning, Memory and Cognition 10, 234-257.

Georgopoulos, A. P., R. E. Kettner, and A. B. Schwartz (1988).
Primate motor cortex and free arm movements to visual targets in three-dimensional space.
The Journal of Neuroscience 8, 2928-2937f.

Gluck, M. A. and C. E. Myers (1997).
Extending models of hippocampal function in animal conditioning to human amnesia.
Memory 5(1/2), 179-212.

Graham, K. S. and J. R. Hodges (1997).
Differentiating the roles of the hippocampal complex and the neocortex in long-term memory storage: Evidence from the study of semantic dementia and Alzheimer's disease.
Neuropsychology 11(1), 77-89.

Grainger, J. and A. M. Jacobs (1996).
Orthographic processing in visual word recognition: A multiple read-out model.
Psychological Review 103(3), 518-565.

Grainger, J. and A. M. Jacobs (1998).
On localist connectionism and psychological science.
In J. Grainger and A. M. Jacobs (Eds.), Localist Connectionist Approaches to Human Cognition, pp. 1-38. Mahwah, NJ: Lawrence Erlbaum Associates.

Green, C. D. (1998).
Are connectionist models theories of cognition?
Psycoloquy 9.

Gross, C. G. (1992).
Representation of visual stimuli.
Philosophical Transactions of the Royal Society London, Series B 335, 3-10.

Grossberg, S. (1972).
Neural expectation: Cerebellar and retinal analogs of cells fired by learnable or unlearned pattern classes.
Kybernetik 10, 49-57.

Grossberg, S. (1982).
Studies of Mind and Brain.
NY: Reidel.

Grossberg, S. (1987).
Competitive learning: from interactive activation to adaptive resonance.
Cognitive Science 11, 23-63.

Grossberg, S. (1997).
Neural models of development and learning.
Behavioral and Brain Sciences 20, 566.

Harnad, S. (Ed.) (1987).
Categorical Perception: The Groundwork of Cognition.
New York: Cambridge University Press.

Harris, C. S. (1980).
Insight or out of sight: Two examples of perceptual plasticity in the human adult.
In C. S. Harris (Ed.), Visual Coding and Adaptability, pp. 95-149. Hillsdale, NJ: Erlbaum Associates.

Hartley, T. and G. Houghton (1996).
A linguistically constrained model of short-term memory for nonwords.
Journal of Memory and Language 35, 1-31.

Hecht-Nielsen, R. (1987).
Counterpropagation networks.
Applied Optics 26, 4979-4984.

Hinton, G. E., J. L. McClelland, and D. E. Rumelhart (1986).
Distributed representations.
In D. E. Rumelhart, J. L. McClelland, and PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1. Foundations, pp. 77-109. Cambridge, MA: MIT Press.

Hintzman, D. L. (1986).
Schema abstraction in a multiple-trace memory model.
Psychological Review 93(4), 411-428.

Hoguchi, S.-I. and Y. Miyashita (1996).
Formation of mnemonic neural responses to visual paired ssociates in inferotemporal cortex is impaired by perirhinal and entorhinal lesions.
Proceedings of the National Academy of Sciences 93, 739-743.

Hopfield, J. (1982).
Neuronal networks and physical systems with emergent collective computational abilities.
Proceedings of the National Academy of Sciences 79, 2554-2558.

Hopfield, J. (1984).
Neurons with graded response have collective computational properties like those of two-state neurons.
Proceedings of the National Academy of Sciences 81, 3058-3092.

Houghton, G. (1990).
The problem of serial order: A neural network memory of sequence learning and recall.
In R. Dale, C. Mellish, and M. Zock (Eds.), Current Research in Natural Language Generation. London: Academic Press.

Howard, D. (1995).
Lexical anomia: Or the case of the missing lexical entries.
Quarterly Journal of Experimental Psychology 48A(4), 999-1023.

Hummel, J. E. and I. Biederman (1992).
Dynamic binding in a neural network for shape recognition.
Psychological Review 99(3), 480-517.

Hummel, J. E. and K. J. Holyoak (1997).
Distributed representations of structure: A theory of analogical access and mapping.
Psychological Review 104, 427-466.

Jacobs, A. M. and J. Grainger (1994).
Models of visual word recognition--sampling the state of the art.
Journal of Experimental Psychology: Human Perception and Performance 20(6), 1311-1334.

Kanerva, P. (1988).
Sparse Distributed Memory.
Cambridge, MA: MIT Press.

Kastner, S., P. D. Weerd, R. Desimone, and L. G. Ungerleider (1998).
Mechanisms of directed attention in the human extrastriate cortex as revealed by functional mri.
Science 282, 108-111.

Keeler, J. D. (1988).
Comparison between kanerva's sdm and hopfield-type neural networks.
Cognitive Science 12, 299-329.

Kohonen, T. (1984).
Self-organization and Associative Memory.
Springer-Verlag.

Kornbrot, D. E. (1978).
Theoretical and empirical comparison of Luce's choice model and logistic Thurstone model of categorical judgement.
Perception and Psychophyics 37(1), 89-91.

Kruschke, J. K. (1992).
ALCOVE: An exemplar-based connectionist model of category learning.
Psychological Review 99(1), 22-44.

Lambon Ralph, M. A. (1998).
Distributed versus localist representations: Evidence from a study of item consistency in a case of classical anomia.
Brain and Language 64, 339-360.

Lee, C. L. and W. K. Estes (1981).
Item and order information in short-term memory: Evidence for multilevel perturbation processes.
Journal of Experimental Psychology: Human Learning and Memory 7, 149-169.

Levelt, W. J. M. (1989).
Speaking: from Intention to Articulation.
Cambridge, MA: MIT Press.

Lewandowsky, S. (1991).
Gradual unlearning and catastrophic interference: A comparison of distributed architectures.
In W. E. Hockley and S. Lewandowsky (Eds.), Relating Theory and Data: Essays on Human Memory on Honor of Bennet B. Murdock. Hillsdale, NJ: Lawrence Erlbaum Associates.

Lewis, J. E. and W. B. Kristan Jr (1998).
A neuronal network for computing population vectors in the leech.
Nature 391, 77-79.

Logan, G. D. (1988).
Towards an instance theory of automatization.
Psychological Review 95, 492-527.

Logan, G. D. (1990).
Repetition priming and automaticity: Common underlying mechanisms?
Cognitive Psychology 22, 1-35.

Logan, G. D. (1992).
Shapes of reaction-time distributions and shapes of learning curves: A test of the instance theory of automaticity.
Journal of Experimental Psychology: Learning, Memory and Cognition 18(5), 883-914.

López, F. J., D. R. Shanks, J. Almaraz, and P. Fernández (1998).
Effects of trial order on contingency judgements: A comparison of associative and probabilistic contrast accounts.
Journal of Experimental Psychology; Learning, Memory and Cognition 24, 672-694.

Luce, R. D. (1959).
Individual Choice Behaviour: A Theoretical Analysis.
New York: John Wiley and Sons, Inc.

Marr, D. (1970).
A theory for cerebral neocortex.
Journal of Physiology (London) 202, 437-470.

Marr, D. (1971).
Simple memory: A theory for archicortex.
Philosophical Transactions of the Royal Society of London, Series B 262, 23-81.

Marshall, J. A. (1990, June).
A self-organizing scale-sensitive neural network.
In Proceedings of the International Joint Conference on Neural Networks, Volume 3, pp. 649-654.

Massaro, D. W. (1987).
Categorical partition: A fuzzy logical model of categorization behavior.
In S. Harnad (Ed.), Categorical Perception: The Groundwork of Cognition, pp. 254-283. New York: Cambridge University Press.

Massaro, D. W. (1988).
Some criticisms of connectionist models of human performance.
Journal of Memory and Language 27, 213-234.

McClelland, J. L. (1979).
On the time relations of mental processes: An examination of systems of processes in cascade.
Psychological Review 86(4), 287-330.

McClelland, J. L. (1981).
Retrieving general and specific information from stored knowledge of specifics.
In Proceedings of the Third Annual Meeting of the Cognitive Science Society, pp. 170-172.

McClelland, J. L. (1991).
Stochastic interactive processes and the effect of context on perception.
Cognitive Psychology 23, 1-44.

McClelland, J. L. (1993).
Toward a theory of information processing in graded, random and interactive networks.
In D. E. Meyer and S. Kornblum (Eds.), Attention and Performance XIV. Synergies in Experimental Psychology, Artificial Intelligence, and Cognitive Neuroscience, pp. 655-688. Cambridge, MA: MIT Press.

McClelland, J. L. and J. Elman (1986).
The TRACE model of speech perception.
Cognitive Psychology 18, 1-86.

McClelland, J. L., B. L. McNaughton, and R. C. O'Reilly (1995).
Why are there compementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory.
Psychological Review 102(3), 419-457.

McClelland, J. L. and D. E. Rumelhart (1981).
An interactive activation model of context effects in letter perception: Part 1. an accout of basic findings.
Psychological Review 88, 375-407.

McCloskey, M. (1991).
Networks and theories.
Psychological Science 2(6), 387-395.

McCloskey, M. and N. J. Cohen (1989).
Catastrophic interference in connectionist networks: The sequential learning problem.
In G. Bower (Ed.), The Psychology of Learning and Motivation, Volume 24, pp. 109-165. New York: Academic Press.

McCollough, C. (1965).
Color adaptation of edge detectors in the human visual system.
Science 149, 1115-1116.

McLaren, I. P. L. (1993a).
APECS: A solution to the sequential learning problem.
In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, Hillsdale, NJ, pp. 717-722. Erlbaum.

McLaren, I. P. L. (1993b).
Catastrophic interference is eliminated in pretrained networks.
In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, Hillsdale, NJ, pp. 723-728. Erlbaum.

McRae, K., V. R. de Sa, and M. S. Seidenberg (1997).
On the nature and scope of featural representations of word meaning.
Journal of Experimental Psychology: General 126, 99-130.

Medin, D. L. and M. M. Schaffer (1978).
Context theory of classification learning.
Psychological Review 85, 207-238.

Minsky, M. and S. Papert (1969).
Perceptrons.
Cambridge, MA: MIT Press.

Miyashita, Y. (1993).
Inferior temporal cortex: Where visual perception meets memory.
Annual Review of Neuroscience 16, 245-263.

Morrison, C. M. and A. W. Ellis (1995).
Roles of word frequency and age of acquisition in word naming and lexical decision.
Journal of Experimental Psychology: Learning, Memory and Cognition 21(1), 116-133.

Morton, J. (1969).
The interaction of information in word recognition.
Psychological Review 76, 165-178.

Mountcastle, V. B. (1997).
The columnar organization of the neocortex.
Brain 120, 701-722.

Murre, J. M. J. (1992).
Learning and Categorization in Modular Neural Networks.
Hertfordshire, UK: Harvester Wheatsheaf.

Murre, J. M. J., R. H. Phaf, and G. Wolters (1992).
Calm: Categorizing and learning module.
Neural Networks 5, 55-82.

Newsome, W. T., K. H. Britten, and J. A. Movshon (1989).
Neuronal correlates of a perceptual decision.
Nature 341, 52-54.

Nigrin, A. L. (1993).
Neural Networks for Pattern Recognition.
Cambridge, MA: MIT Press.

Norris, D. (1994a).
A quantitative multiple-levels model of reading aloud.
Journal of Experimental Psychology: Human Perception and Performance 20, 1212-1232.

Norris, D. (1994b).
SHORTLIST: A connectionist model of continuous speech recognition.
Cognition 52, 189-234.

Nosofsky, R. M. (1985).
Luce's choice model and Thurstone's categorical judgement model compared.
Perception and Psychophysics 37(1), 89-91.

Nosofsky, R. M. (1986).
Attention, similarity and the identification-categorization relationship.
Journal of Experimental Psychology: Learning, Memory and Cognition 115(1), 39-57.

Nosofsky, R. M. (1987).
Attention and learning processes in the identification and categorization of integral stimuli.
Journal of Experimental Psychology: Learning, Memory and Cognition 13(1), 87-108.

Nosofsky, R. M. (1990).
Relations between exemplar-similarity and likelihood models of classification.
Journal of Mathematical Psychology 34, 393-418.

Nosofsky, R. M. and T. J. Palmeri (1997).
An exemplar-based random walk model of speeded classification.
Psychological Review 104(2), 266-300.

Oden, G. C. and D. W. Massaro (1978).
Integration of featural information in speech perception.
Psychological Review 85, 172-191.

Oram, M. W., P. Földiák, D. I. Perrett, and F. Sengpiel (1998).
The `ideal homunculus': Decoding neural population signals.
Trends in Neurosciences 21, 259-265.

Page, M. P. A. (1993, November).
Modelling Aspects of Music Perception using Self-Organizing Neural Networks.
Ph. D. thesis, University of Wales College of Cardiff, Cardiff, UK.

Page, M. P. A. (1994).
Modelling the perception of musical sequences with self-organizing neural networks.
Connection Science 6(2/3), 223-246.

Page, M. P. A. and I. Nimmo-Smith (in prep.).
Properties of a localist, connectionist, Thurstonian model.

Page, M. P. A. and D. Norris (1997).
A localist implementation of the primacy model of immediate serial recall.
In J. Grainger and A. M. Jacobs (Eds.), Localist Connectionist Approaches to Human Cognition. Mahwah, NJ: Lawrence Erlbaum Associates.

Page, M. P. A. and D. Norris (1998).
The primacy model: A new model of immediate serial recall.
Psychological Review 105, 761-781.

Palmeri, T. J. (1997).
Exemplar similarity and the devlopment of automaticity.
Journal of Experimental Psychology: Learning, Memory and Cognition 23(2), 324-354.

Pearce, J. M. (1994).
Similarity and discrimination.
Psychological Review 101, 587-607.

Pickering, A. D. (1997).
New approaches to the studying of amnesic patients: What can a neurofunctional philosophy and neural network methods offer.
Memory 5(1/2), 255-300.

Plaut, D. C., J. L. McClelland, M. S. Seidenberg, and K. Patterson (1996).
Understanding normal and impaired word reading: Computational principles in quasi-regular domains.
Psychological Review 103, 56-115.

Quartz, S. R. and T. J. Sejnowski (1997).
The neural basis of cognitive development: A constructivist manifesto.
Behavioral and Brain Sciences 20, 537-596.

Quillian, M. R. (1968).
Semantic memory.
In M. Minsky (Ed.), Semantic Information Processing, pp. 227-270. Cambridge, MA: MIT Press.

Ramsey, W. (1997).
Do connectionist representations earn their explanatory keep.
Mind and Language 12(1), 34-66.

Ratcliff, R. (1978).
A theory of memory retrieval.
Psychological Review 85(2), 59-108.

Ratcliff, R. (1990).
Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.
Psychological Review 97, 285-308.

Robins, A. (1995).
Catastrophic forgetting, rehearsal, and pseudorehearsal.
Connection Science 7, 123-146.

Roelfsema, P. R., A. K. Engel, P. König, and W. Singer (1996).
The role of neuronal synchronization in response selection: A biologically plausible theory of structured representations in the visual cortex.
Journal of Cognitive Neuroscience 8(6), 603-625.

Rolls, E. T. (1989).
Parallel distributed processing in the brain: Implications of the functional architecture of neuronal networks in the hippocampus.
In R. G. M. Morris (Ed.), Parallel Distributed Processing: Implications for Psychology and Neurobiology. Oxford, UK: Oxford University Press.

Rolls, E. T., H. D. Critchley, and A. Treves (1996).
Representation of olfactory information in the primate orbitofrontal cortex.
Journal of Neurophysiology 75(5), 1982-1996.

Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986).
Learning internal representations by error propagation.
In D. E. Rumelhart, J. L.McClelland, and the PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1. Foundations, pp. 318-362. Cambridge, MA: MIT Press.

Rumelhart, D. E. and J. L. McClelland (1982).
An interactive activation model of context effects in letter perception: Part 2. the contextual enhancement effect and some tests and extensions of the model.
Psychological Review 89, 60-94.

Rumelhart, D. E., J. L. McClelland, and PDP Research Group (1986).
Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1. Foundations.
Cambridge, MA: MIT Press.

Rumelhart, D. E. and D. Zipser (1986).
Feature discovery by competitive learning.
In D. E. Rumelhart, J. L.McClelland, and the PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1. Foundations, pp. 151-193. Cambridge, MA: MIT Press.

Sakai, K. and Y. Miyashita (1991).
Neural organization for the long-term memory of paired associates.
Nature 354, 152-155.

Sakai, K., Y. Naya, and Y. Miyashita (1994).
Neuronal tuning and associative mechanisms in form representation.
Learning and Memory 1, 83-105.

Salzman, C. D. and W. T. Newsome (1994).
Neural mechanisms for forming a perceptual decision.
Science 264, 231-236.

Seidenberg, M. S. (1993).
Connectionist models and cognitive theory.
Psychological Science 4, 228-235.

Seidenberg, M. S. and J. L. McClelland (1989).
A distributed, developmental model of word recognition.
Psychological Review 96, 523-568.

Sharkey, N. E. and A. J. C. Sharkey (1995).
An analysis of catastrophic interference.
Connection Science 7(3/4), 301-329.

Shastri, L. and V. Ajjanagadde (1993).
From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal asynchrony.
Behavioral and Brain Sciences 16, 417-494.

Shepard, R. N. (1958).
Stimulus and response generalization: Deduction of the generalization gradient from a trace model.
Psychological Review 65(4), 242-256.

Shepard, R. N. (1987).
Towards a universal law of generalization for psychological science.
Science 237, 1317-1323.

Sloman, S. A. and D. E. Rumelhart (1992).
Reducing interference in distributed memories through episodic gating.
In A. S. Healy, S. Kosslyn, and R. Shiffrin (Eds.), From Learning Theory to Cognitive Processes: Essays in Honor of William K. Estes. Hillsdale, NJ: Lawrence Erlbaum Associates.

Snowden, J. S., H. L. Griffiths, and D. Neary (1996).
Semantic-episodic memory interactions in semantic dementia: Implications for retrograde memory function.
Cognitive Neuropsychology 13(8), 1101-1137.

Squire, L. R., N. J. Cohen, and L. Nadel (1984).
The medial temporal region and memory consolidation: A new hypothesis.
In H. Weingartner and E. Parker (Eds.), Memory Consolidation, pp. 185-210. Hillsdale, NJ: Erlbaum.

Thorpe, S. (1995).
Localized versus distributed representations.
In M. A. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks, pp. 549-552. Cambridge, MA: MIT Press.

Thurstone, L. L. (1927).
Psychophysical analysis.
American Journal of Psychology 38, 368-389.

Usher, M. and J. L. McClelland (1995, December).
On the time course of perceptual choice: A model based on principles of neural computation.
Technical Report PDP.CNS.95.5, Dept. of Psychology, Carnegie Mellon University.

Valiant, L. (1994).
Circuits of the Mind.
New York: Oxford University Press.

van Santen, J. P. H. and D. Bamber (1981).
Finite and infinite state confusion models.
Journal of Mathematical Psychology 24, 101-111.

van Zandt, T. and R. Ratcliff (1995).
Statistical mimicking of reaction time data: Single-process models, parameter variability, and mixtures.
Psychonomic Bulletin and Review 2(1), 20-54.

Yellott Jr., J. I. (1977).
The relationship between Luce's choice axiom, Thurstone's theory of comparative judgement, and the double exponential distribution.
Journal of Mathematical Psychology 15, 109-144.

Young, M. P. and S. Yamane (1992, May).
Sparse population coding of faces in the inferotemporal cortex.
Science 256, 1327-1331.

Young, M. P. and S. Yamane (1993).
An analysis at the population level of the processing of faces in the inferotemporal cortex.
In T. Ono, L. R. Squire, M. E. Raichle, D. I. Perrett, and M. Fukuda (Eds.), Brain Mechanisms of Perception and Memory: From Neuron to Behavior, Chapter 4, pp. 47-70. NY: Oxford Univerity Press.

Zorzi, M., G. Houghton, and B. Butterworth (1998).
Two routes or one in reading aloud? a connectionist dual-process model.
Journal of Experimental Psychology: Human Perception and Performance 24, 1131-1161.

Footnotes

1. Though see Bundesen (1993) for a review of independent race models that have similar properties with regard to the Luce choice rule. This review came to my attention too late in the preparation of this target article to allow proper discussion within.

2. Indeed, it is possible to cast the optimal, Bayesian approach to the decoding of activation patterns on the coding layer, as discussed and preferred (relative to the weighted vector method described above) by Oram, Földiák, Perrett & Sengpiel, 1998, in the form of a localist classifier of the type discussed earlier. The link relies on formal similarities between the Luce-Shepard choice rule and Bayes rule as applied to conditional probabilities expressed as exponential functions. Decoding would comprise a winner-take-all competition over a layer of cells, themselves responding to and classifying the patterns of activation found in the coding layer. Because each of the cells in the classification layer would respond best to a given pattern of activation over the coding layer (itself corresponding to a given directional stimulus), and less strongly to more distant patterns, activations in the classification layer would themselves appear to comprise another distributed coding of motion direction, in spite of their being decodable (to give the maximally-likely stimulus-direction) by a simple localist competitive process.

Author Notes

I would like to thank Dennis Norris, Ian Nimmo-Smith, John Duncan, Rik Henson, Gareth Gaskell and Andy Young for many useful discussions and for their help in the preparation of this manuscript. I would also like to thank D.Amit, H.Barlow, M.Coltheart, J.Feldman, R.French, J.Murre, P.Thagard, X.Wu and other, anonymous reviewers for their thoughtful comments.

All correspondence and requests for reprints should be sent to Mike Page, M.R.C. Cognition and Brain Sciences Unit, 15, Chaucer Rd., Cambridge, CB2 2EF, U.K. (mike.page@mrc-cbu.cam.ac.uk)