Below is the unedited penultimate draft of:
Page Mike (2000) Connectionist Modelling in Psychology: A Localist
Manifesto
Behavioral and Brain Sciences 23 (4): XXX-XXX.
This is the unedited penultimate draft of a BBS target article that has been accepted for publication (Copyright 1999: Cambridge University Press U.K./U.S. -- publication date provisional) and is currently being circulated for Open Peer Commentary. This preprint is for inspection only, to help prospective commentators decide whether or not they wish to prepare a formal commentary. Please do not prepare a commentary unless you have received the hard copy, invitation, instructions and deadline information.
For information on becoming a commentator on this or other BBS target articles, write to: bbs@soton.ac.uk
For information about subscribing or purchasing offprints of the published version, with commentaries and author's response, write to: journals_subscriptions@cup.org (North America) or journals_marketing@cup.cam.ac.uk (All other countries).
CONNECTIONIST MODELLING IN PSYCHOLOGY:
A LOCALIST MANIFESTO
Mike Page
Medical Research Council Cognition and Brain Sciences Unit,
15, Chaucer Rd.,
Cambridge, CB2 2EF,
U.K.
mike.page@mrc-cbu.cam.ac.uk
http://www.mrc-cbu.cam.ac.uk/
Over the last decade, fully-distributed models have become dominant in connectionist
psychological modelling, whereas the virtues of localist models have been underestimated.
This target article illustrates some of the benefits of localist modelling. Localist
models are characterized by the presence of localist representations rather than the
absence of distributed representations. A generalized localist model is proposed that
exhibits many of the properties of fully distributed models. It can be applied to a number
of problems that are difficult for fully distributed models and its applicability can be
extended through comparisons with a number of classic mathematical models of behaviour.
There are reasons why localist models have been underused and these are addressed. In
particular, many conclusions about connectionist representation, based on neuroscientific
observation, are called into question. There are still some problems inherent in the
application of fully distributed systems and some inadequacies in proposed solutions to
these problems. In the domain of psychological modelling, localist modelling is to be
preferred.
KEYWORDS: connectionist modelling, neural networks, localist, distributed, competition, choice, reaction-time, consolidation.
The aim of this target article is to demonstrate the power, flexibility and plausibility of connectionist models in psychology that use localist representations. I will take care to define the terms ``localist'' and ``distributed'' in the context of connectionist models and to identify the essential points of contention between advocates of each type of model. Localist models will be related to some classic mathematical models in psychology and some of the criticisms of localism will be addressed. This approach will be contrasted with a currently popular one in which localist representations play no part. The conclusion will be that the localist approach is preferable whether one considers connectionist models as psychological-level models, or as models of the underlying brain processes.
At the time of writing, it is thirteen years since the publication of ``Parallel Distributed Processing: Explorations in the Microstructures of Cognition'' (Rumelhart, McClelland and the PDP Research Group 1986). That two-volume set has had an enormous influence on the field of psychological modelling (among others) and justifiably so, having helped to revive widespread interest in the connectionist enterprise after the seminal criticisms of Minsky and Papert (1969). In fact, despite Minsky and Papert's critique, a number of researchers (e.g., S. Amari, K. Fukushima, S. Grossberg, T. Kohonen, C. von der Malsburg, etc.) had continued to develop connectionist models throughout the 1970s, often in directions rather different from that in which the 1980's ``revival'' later found itself heading. More specifically, much of the earlier work had investigated networks in which localist representations played a prominent role, whereas, by contrast, the style of modelling that received most attention as a result of the PDP research group's work was one that had at its centre the concept of distributed representation. It is more than coincidental that the word ``distributed'' found itself centrally located in both the name of the research group and the title of its major publication but it is important to note that, in these contexts, the words ``parallel'' and ``distributed'' both refer to processing rather than to representation. Although it is unlikely that anyone would deny that processing in the brain is carried out by many different processors in parallel (i.e. at the same time) and that such processing is necessarily distributed (i.e. in space), the logic that leads from a consequent commitment to the idea of distributed processing, to an equally strong commitment to the related, but distinct, notion of distributed representation, is more debatable. In this target article I hope to show that the thoroughgoing use of distributed representations, and the learning algorithms associated with them, is very far from being mandated by a general commitment to parallel distributed processing.
As indicated above, I will advocate a modelling approach that supplements the use of distributed representations (the existence of which, in some form, nobody could deny) with the additional use of localist representations. The latter have acquired a bad reputation in some quarters. This cannot be directly attributed to the PDP books themselves, in which several of the models were localist in flavour (e.g. interactive activation and competition models, competitive learning models). Nonetheless, the terms ``PDP'' and ``distributed'' on the one hand, and ``localist'' on the other, have come to be seen as dichotomous. I will show this apparent dichotomy to be false and will identify those issues over which there is genuine disagreement.
A word of caution: ``Neural networks'' have been applied in a wide variety of other areas in which their plausibility as models of cognitive function is of no consequence. In criticizing what I see to be the overuse (or default use) of fully distributed networks, I will accordingly restrict discussion to their application in the field of connectionist modelling of cognitive or psychological function. Even within this more restricted domain there has been a large amount written about the issues addressed here. Moreover, it is my impression that the sorts of things to be said in defence of the localist position will have occurred independently to many of those engaged in such a defence. I apologize in advance, therefore, for necessarily omitting any relevant references that have so far escaped my attention. No doubt the BBS commentary will set the record straight.
The next section will define some of the terms to be used throughout this paper. As will be seen, certain subtleties in such definitions that becloud the apparent clarity of the localist/distributed divide.
Before defining localist and distributed representations, we establish some more basic vocabulary. In what follows the word nodes will refer to the simple units out of which connectionist networks have traditionally been constructed. A node might be thought of as consisting of a single neuron or a distinct population of neurons (e.g. a cortical minicolumn). A node will be referred to as having a level of activation, where a loose analogy is drawn between this activation and the firing rate (mean or maximum firing rate) of a neuron (population). The activation of a node might lead to an output signal's being projected from it. The projection of this signal will be deemed to be along one or more weighted connections, where the concept of weight in some way represents the variable ability of output from one node to affect processing at a connected node. The relationship between the weighted input to a given node (i.e those signals projected to it from other nodes), its activation, and the output which it in turn projects, will be summarized using a number of simple, and probably familiar, functions. All of these definitions are, I hope, an uncontroversial statement of the basic aspects of the majority of connectionist models.
The following definitions, drawn from the recent literature, largely capture the difference between localist and distributed representations. First, distributed representations:
``many neurons participate in the representation of each memory and different representations share neurons'' (Amit 1995, p.621).
``the model makes no commitment to any particular form of representation, beyond supposing that the representations are distributed; that is, each face, semantic representation, or name is represented by multiple units, and each unit represents multiple faces, semantic units or names.'' (Farah, O'Reilly & Vecera 1993, p.577)
The latter definition refers explicitly to a particular model of face naming, but the intended nature of distributed representations in general is clear. To illustrate the point, suppose we wished to represent the four entities ``John'', ``Paul'', ``George'' and ``Ringo''. Figure 1a shows distributed representations for these entities. Each representation involves a pattern of activation across four nodes and, importantly, there is overlap between the representations. For instance, the first node is active in the patterns representing both John and Ringo, the second node is active in the patterns representing both John and Paul, and so on. A corollary of this is that the identity of the entity that is currently represented cannot be unambiguously determined by inspecting the state of any single node.
![]() |
Now consider the skeleton of a definition of a localist representation, as contrasted with a distributed coding:
``...with a local representation, activity in individual units can be interpreted directly...with distributed coding individual units cannot be interpreted without knowing the state of other units in the network'' (Thorpe 1995, p.550).
As an example of a localist representation of our four entities, see Figure 1b. In such a representation, only one node is active for any given entity. As a result, activity at a given unit can unambiguously identify the currently represented entity.
When nodes are binary (i.e., having activity 1 or 0), these definitions are reasonably clear. But how are they affected if activity can take, for example, any value between these limits? The basic distinction remains: in the localist model, it will still be possible to interpret the state of a given node independent of the states of other nodes. A natural way to ``interpret'' the state of a node embedded in a localist model would be to propose, as did Barlow (1972), a monotonic mapping between activity and confidence in the presence of the node's referent:
``the frequency of neural impulses codes subjective certainty: a high impulse frequency in a given neuron corresponds to a high degree of confidence that the cause of the percept is present in the external world'', (Barlow 1972, p.381)
It may be that the significance of activating of a given node is assessed in relation to a threshold value, such that only super-threshold activations are capable of indicating nonzero confidence. Put another way, the function relating activation to ``degree of confidence'' would not necessarily be linear, or even continuously differentiable, in spite of being monotonic nondecreasing.
Having offered both Thorpe's and Barlow's descriptions of localist representation, it is worth pointing out that interpreting a node's activation as ``degree of confidence'' is potentially inconsistent with the desire to interpret a given node's activation ``directly,'' that is independent of the activation of other nodes. For example, suppose, in a continuous-activation version of Figure 1b, that two nodes have near maximal activity. In some circumstances we will be happy to regard this state as evidence that both the relevant referents are present in the world: in this case the interpretation of the node activations will conform to the independence assumption. In other cases, we might regard such a state as indicating some ambiguity as to whether one referent or the other is present. In these cases it is not strictly true to say that the degree of confidence in a particular referent can be assessed by looking at the activation of the relevant node alone, independent of that of other nodes. One option is to assume instead that activation maps onto relative degree of confidence, such that degree of activation is interpreted relative to that of other nodes. Although strictly inconsistent with Thorpe's desire for direct interpretation, this preserves what is essential about a localist scheme, namely that the entity about which relative confidence is being expressed is identified with a single node. Alternatively, both Thorpe's and Barlow's definitions can be simultaneously maintained if some competitive process is implemented directly (i.e mechanically), such that it is impossible to sustain simultaneously high activations at two nodes whose interpretations are contradictory. A scheme of this type would, for example, allow, two nodes to compete for activation so as to exclusively identify a single person.
As an aside, note that a simple competitive scheme has some disadvantages. Such a scheme is apparently inadequate for indicating the presence of two entities, say, John and Paul, by strongly activating the two relevant nodes simultaneously. One solution to this apparent conundrum might be to invoke the notion of binding, perhaps implemented by phase relationships in node firing patterns (e.g. Hummel & Biedermann 1992; Roelfsema, Engel, König & Singer 1996; Shastri & Ajjanagadde 1993). (Phase relationships are only one candidate means of perceptual binding and will be assumed here solely for illustrative purposes.) Thus, in the case in which we wish both John and Paul to be simultaneously indicated, both nodes can activate fully but out of phase with each other, thus diminishing the extent to which they compete. This out-of-phase relationship might stem from the fact that the two entities driving the system (John and Paul) must be in two different spatial locations, allowing them to be ``phased'' separately. In the alternative scenario, that is, when only one individual is present, the nodes representing alternative identifications might be in phase with each other, driven as they are by the same stimulus object, and would therefore compete as required.
A similar binding scheme might also be useful if distributed representations are employed. On the face of it, using the representations in Figure 1a, the pattern for John and George will be the same as that for Paul and Ringo. It may be possible to distinguish these summed patterns on the basis of binding relationships as before -- to represent John and George the first and second nodes would be in phase with each other while the third and fourth nodes would both be out of phase with the first two nodes and in phase with each other. But complications arise when we wish to represent, say, John and Paul: would the second node be in phase with the first node or the third? It is possible that in this case the second node would fire in both the phases associated with nodes one and two (though this would potentially affect its firing rate as well as its phase relationships). Mechanisms for binding are the focus of a good deal of ongoing research, so I shall not develop these ideas further here.
In a discussion of localist and distributed representations it is hard to avoid the subject of ``grandmother cells.'' The concept can be traced back to a lecture series delivered by Jerome Lettvin in 1969 (see Lettvin's appendix to Barlow 1995), in which he introduced, to a discussion on neural representation, an allegory, in which a neuroscientist located in the brains of his animal subjects
``some 18,000 neurons...that responded uniquely only to the animal's mother, however displayed, whether animate or stuffed, seen from before or behind, upside down or on a diagonal, or offered by caricature, photograph or abstraction.'' (from appendix to Barlow 1995)
The allegorical neuroscientist ablated the equivalent cells in a human subject, who, postoperatively, could not conceive of ``his mother,'' while maintaining a conception of mothers in general. The neuroscientist, who was intent on showing that ``ideas are contained in specific cells'', considered his position to be vulnerable to philosophical attack, and rued not having searched for grandmother cells instead, grandmothers being ``notoriously ambiguous and often formless''.
The term ``grandmother cell'' has since been used extensively in discussions of neural representation, though not always in ways consistent with Lettvin's original conception. It seems that (grand)mother cells are considered by some to be the necessary extrapolation of the localist approach and thereby to demonstrate its intrinsic folly. I believe this conclusion to be entirely unjustified. Whatever the relevance of Lettvin's allegory, it certainly does not demonstrate the necessary absurdity of (grand)mother cells and, even if it did, this would not warrant a similar conclusion regarding localist representations in general. Given the definitions so far advanced, it is clear that, while (grand)mother cells are localist representations, not all localist representations necessarily have the characteristics attributed by Lettvin to (grand)mother cells This depends on how one interprets Lettvin's words ``responded uniquely'' (above). A localist representation of one's grandmother might respond partially, but subthreshold, to a similar entity (e.g. one's great aunt), thus violating one interpretation of the ``unique response'' criterion that forms part of the grandmother-cell definition.
A related point concerns the ``yellow Volkswagen cells'' referred to by Harris (1980). Harris' original point, which dates back to a talk given in 1968, illustrated a concern regarding a potential proliferation in the types of selective cells hypothesized to be devoted to low-level visual coding. Such a proliferation had been suggested by experiments into, for instance, the ``McCollough Effect''(McCollough 1965) which had led to the positing of detectors sensitive to particular combinations of orientation and colour. The message that has been extrapolated from Harris' observation is one concerning representational capacity: that while ``yellowness'' cells and ``Volkswagen cells'' may be reasonable, surely specific cells devoted to ``yellow Volkswagens'' are not. The fear is that if yellow VWs are to be locally represented then so must the combinatorially explosive number of equivalent combinations (e.g. lime-green Minis). There is something odd about this argument. In accepting the possibility of Volkswagen cells, it begs the question as to why the fear of combinatorial explosion is not already invoked at this level. Volkswagens themselves must presumably be definable as a constellation of a large number of adjective-noun properties (curved roof, air-cooled engine etc.), and yet accepting the existence of Volkswagen cells does not presume a vast number of other cells, one for each distinct combination of feature-values in whatever feature-space VWs inhabit. On a related point, on occasions when the (extrapolated) yellow-VW argument is invoked, it is not always clear whether the supposed combinatorial explosion refers to the number of possible percepts, which is indeed unimaginably large, or to the vanishingly smaller number of percepts which are witnessed and, in some sense, worth remembering. Since the latter number is likely to grow only approximately linearly with lifespan, fears of combinatorial explosion are unwarranted. It is perfectly consistent with the localist position that different aspects of a stimulus (e.g. colour, brand-name etc.) can be represented separately, and various schemes have been suggested for binding such aspects together so as to correctly represent, in the short term, a given scene (e.g. Hummel & Biedermann 1992; Roelfsema, Engel, König & Singer 1996; see earlier). This systematicity (cf. Fodor & Pylyshyn 1988) in the perceptual machinery addresses the problem of combinatorial explosion regarding the number of possible percepts. It in no way implies, however, that in a localist model each possible percept must be allocated its own permanent representation, that is, its own node. A similar point was made by Hummel and Holyoak (1997) who noted that
``...it is not necessary to postulate the preexistence of all possible conjunctive units. Rather a novel binding can first be represented dynamically (in active memory), with a conjunctive unit created only when it is necessary to store the binding in LTM.'' (p.434)
It is entirely consistent with the localist position to postulate that cells encoding specific combinations will be allocated only when needed: perhaps in an experiment in which pictures of yellow VWs and red bikes require one response, while red VWs and yellow bikes require another (cf. XOR); or, more prosaically, in establishing the memory that one's first car was a yellow VW. When one restricts the number of localist representations to those sufficient to describe actual percepts of behavioural significance (i.e. those which require long-term memorial representation) the threat of combinatorial explosion dissipates. Later I shall show how new localist nodes can be recruited, as needed, for the permanent representation of previously unlearned configurations (cf. the constructivist learning of Quartz & Sejnowski 1997, and the accompanying commentary by Grossberg 1997; Valiant 1994).
The above discussion of yellow VWs illustrates the issue of featural representation. A featural representation will be defined here as a representation comprising an array of localist nodes in appropriate states. Figure 2 shows the featural representations of Tony Blair, Glenda Jackson, Anthony Hopkins and Queen Elizabeth II, where the relevant features are ``is-a-woman,'' ``is-a-politician'' and ``is/was-a-film-actor.'' Clearly, the representations of these four entities are distributed, in the sense that the identity of the currently present entity cannot be discerned by examining the activity of any individual node. Nonetheless, the features themselves are locally represented (cf. ``is-yellow,'' ``is-a-Volkswagen''). Whether or not a politician is currently present can be decided by examining the activity of a single node, independent of the activation of any other node.
![]() |
It is curious that researchers otherwise committed to the thoroughgoing use of distributed representations have been happy to use such featural representations. For instance, Farah et al. (1993), whose commitment to distributed representations was quoted earlier, used a distributed representation for semantic information relating to particular people. To continue the earlier quotation:
``The information encoded by a given unit will be some `microfeature'...that may or may not correspond to an easily labeled feature (such as eye color in the case of faces). The only units for which we have assigned an interpretation are the `occupation units' within the semantic pool. One of them represents the semantic microfeature `actor' and the other represents the semantic microfeature `politician'.'' (Farah et al. 1993, p.577)
It would be odd to be comfortable with the idea of nodes representing ``is-an-actor,'' and yet hostile to the idea of nodes representing ``is-Tony-Blair'' or ``is-my-grandmother''. If ``is-an-actor'' is a legitimate microfeature (though one wonders what is micro about it), then why is ``is-Tony-Blair'' not? Is there any independent rationale for what can and cannot be a microfeature? Moreover, to anticipate a later discussion, by what learning mechanism are the localist (micro)featural representations (e.g. ``is-an-actor'') themselves deemed to be established? The most natural assumption is that, at some level, local unsupervised featural learning is carried out. But a commitment to fully distributed representation of identity, if not profession, would therefore require that at some arbitrary stage just before the level at which identity features (e.g. ``is-Tony-Blair'') might emerge, a different, supervised learning mechanism cuts in.
Whether or not we choose to define featural representations as a subclass of distributed representations has little to do with the core of the localist/distributed debate. No localist has ever denied the existence of distributed representations, especially, but not exclusively, if these are taken to include featural representations. To do so would have entailed a belief that percepts ``go local'' in a single step, from retina directly to grandmother cell, for instance. The key tenet of the localist position is that, on occasion, localist representations of meaningful entities in the world (e.g. words, names, people, etc.) emerge and allow, among other things, distributed/featural patterns to be reliably classified and enduringly associated.
I should make clear that restricting the definition in the preceding paragraph to ``meaningful entities in the world'' is simply a rather clumsy way of avoiding potentially sterile discussions of how far localist representation extends down the perceptual hierarchy. To take a concrete example, one might ask whether an orientation column (OC) in the visual cortex should be considered a localist representation of line segments in a particular part of the visual field and at a particular angular orientation. An opponent of such a localist description might argue that in most everyday circumstances nothing of cognitive significance (nothing of meaning, if you like) will depend on the activation state of an individual OC and that later stages in the perceptual path will best be driven by a distributed pattern of activation across a number of OCs so as to preserve the information available in the stimulus. I am sympathetic to this argument -- there seems little point in describing a representation as localist if it is never interpreted in a localist manner. Nonetheless, to temper this conclusion somewhat, imagine an experiment in which a response is learned that depends on which of two small line segments, differing only in orientation, is presented. Assuming that such a discrimination is learnable, it does not seem impossible a priori that a connectionist model of the decision task would depend rather directly on the activations of specific OCs. (The issue is related to the decoding of population vectors, discussed briefly in section 4.3.1 and in the accompanying footnote.) I have not modelled performance in this rather contrived task and hence cannot say what should be concluded from such a model. One can simply note that certain models might lead to a more charitable view towards an interpretation that treated single OCs as localist representations. The general point is that a representation might be labelled localist or not depending on the particulars of the modelled task in which the corresponding nodes are taken to be involved. Whether one chooses to reserve the term localist for representations that are habitually involved in processes/tasks that highlight their localist character or, alternatively, whether one allows the term to apply to any representational unit that can at some time (albeit in unusual or contrived circumstances) usefully be treated as localist, is probably a matter of taxonomic taste. For fear of getting unnecessarily involved in such matters, I will retreat to using the term localist to refer, as above, to a form of representation of meaningful entities in the world whose localist character is habitually displayed. I do so in the hope and belief that, at least in the modelling of most types of cognitive-psychological task, it will be clear what the relevant meaningful entities are.
Given the definitions of localist and distributed representations discussed so far,
what are we to understand by the term ``a localist model''? The first and most crucial
point, alluded to above, is that a localist model is not well defined as one that uses
localist rather than distributed representations: localist models almost always use both
localist and distributed representations. More explicitly, any entity that is locally
represented at layer
of a hierarchy is sure to be represented in a
distributed fashion at layer
. To illustrate, take as an example the
interactive activation (IA) model of visual word recognition (McClelland & Rumelhart
1981; Rumelhart and McClelland 1982) which is generally agreed to be localist. The model
employs successive processing layers: In the ``lowest'' of these are visual-feature
detectors, which respond selectively to line segments in various orientations; in the next
layer are nodes which respond selectively to letters in various positions in a word; in
the third, nodes which respond maximally to individual familiar words. Thus, a given word
is represented locally in the upper layer and in a distributed fashion at the two previous
layers. Letters-in-position are likewise represented locally in the second layer but in a
distributed manner in the first layer. It accordingly makes no sense to define a localist
model as one that precludes distributed representation. A better definition relies only on
whether or not there are localist representations of the relevant entities.
It so happens that, in the IA example, the distributed representations at lower layers are of the featural variety, as discussed above. This, however, is not a crucial factor in the IA model's being labelled localist: The lower layers might have used distributed representations unamenable to a featural characterization without nullifying the fact that in the upper layer a localist code is used. The difference between localist and distributed models is most often not in the nature or status of the representation of the input patterns, which depends ultimately (in vivo) on the structure and function of the relevant sense organ(s), but in the nature of representation at the later stages of processing that input. As stated above, localists posit that certain cognitively meaningful entities will be represented in a local fashion at some, probably late, level of processing, and it is at this level that decisions about which entities are identifiable in any given input can best be made.
So can the term ``localist model'' be universally applied to models using localist representations? Not without care. Consider the model of reading proposed by Plaut, McClelland, Seidenberg and Patterson (1996). This was developed from the seminal model of Seidenberg and McClelland (1989), in which neither letters at the input nor phonemes at the output were represented in a local fashion. According to Plaut et al. it was this aspect of the model, among others, which manifested itself in its relatively poor nonword reading. Plaut et al. referred to this as the ``dispersion problem''. Perhaps, as Jacobs and Grainger (1994) rather archly suggest, it might better have been termed the distribution problem, given that Plaut et al.'s solution entailed a move to an entirely local scheme for both input orthography (letters and letter clusters) and output phonemes. And yet, even with this modification, it would be very misleading to call Plaut et al.'s a localist model: The most powerful and theoretically bold contribution of that model was to show that the mapping between orthographic representations of both words and nonwords and their pronunciations could be carried out in a distributed fashion, that is, without any recourse to either a locally represented mental lexicon or an explicit system of grapheme-to-phoneme correspondence rules. So whereas the Plaut et al. model was certainly localist at the letter and phoneme levels, it was undeniably distributed at the lexical level. It is for this reason that calling that model localist would be thoroughly misleading. I conclude that the term ``localist model'' should be used with care. In most cases, it will be better to be explicit about the entities for which localist coding is used (if any), and to identify the theoretical significance of this choice.
A further point should be made regarding localist models, again taking the IA model as our example. When a word is presented to the IA model, a large number of nodes will be maximally active -- those representing certain visual features, letters-in-position and the word itself. A number of other nodes will be partially active. On presentation of a nonword, no word-node will attain maximal activation but otherwise the situation will be much the same. The simple point is this: The fact that activity is distributed widely around the network should not lead to the incautious suggestion that the IA model is a distributed rather than a localist model. As noted earlier, it is important to distinguish between distributed processing and distributed representation. Having made this distinction we can better interpret labels that have been applied to other models in the literature, labels which might otherwise have the potential to confuse.
As an illustration, consider the Learning and Inference with Schemas and Analogies (LISA) model of Hummel and Holyoak (1997), as applied to the processing of analogy. The title of the paper (``Distributed Representations of Structure: A Theory of Analogical Access and Mapping'') might suggest that LISA is a fully distributed model, but a closer look reveals that it uses localist representation. For instance, in its representation of the proposition ``John loves Mary,'' there is a node corresponding to the proposition itself, and to each of the constituents ``John'', ``Mary'' and ``loves''; these nodes project in turn onto a layer of semantic units which are crucially involved in the analogy processing task. The whole network is hierarchically structured, with activity being distributed widely for any given proposition and, in this case, organized in time so as to reflect various bindings of, for example, subjects with predicates. (Most, if not all, models that use phase binding do so in the context of localist representation.) LISA thus constitutes a clear example of the interaction between localist representations of entities and a distributed or featural representation of semantics. As in the IA model, there is no contradiction between distributed processing and localist representation. At the risk of overstating the case, we can see exactly the same coexistence of local representation, distributed representation and distributed processing in what is often considered a quintessentially localist model namely Quillian's (1968) model of semantics. Quillian's justly influential model did indeed represent each familiar word with a localist ``type'' unit. But a word's meaning was represented by an intricate web of structured connections between numerous tokens of the appropriate types, resulting, on activation of a given word-type, in a whole plane of intricately structured spreading activation through which semantic associative relationships could become apparent.
To summarize, a localist model of a particular type of entity (e.g. words) is characterized by the presence of (at least) one node which responds maximally to a given familiar (i.e. learned) example of that type (e.g. a given familiar word), all familiar examples of that type (e.g. all familiar words) being so represented. This does not preclude some redundancy in coding. For example, in the word example used here, it may be that various versions of the same word (e.g. different pronunciations) are each represented locally, though in many applications these various versions would be linked at some subsequent stage so as to reflect their lexical equivalence.
It is hoped that this definition of what constitutes a localist model will help to clarify issues of model taxonomy. Under this taxonomy, the term ``semilocalist'' would be as meaningless as the term ``semipregnant''. But what are we to make of representations that are described as ``sparse distributed'' or ``semidistributed''? It is rather difficult to answer this question in general because there is often no precise definition of what is meant by these terms. Sparse distributed representational schemes are frequently taken to be those for which few nodes activate for a given stimulus with few active nodes shared between stimuli, but this definition begs a lot of questions. For examplee, how does the definition apply to cases in which nodes have continuous rather than binary activations? To qualify as a sparse distributed representational scheme, are nodes required to activate to identical degrees for several different stimuli (cf. Kanerva's binary Sparse Distributed Memory, Kanerva 1988; Keeler 1988)? Or are nodes simply required to activate (i.e. significantly above baseline) for more than one stimulus? Certainly in areas in which the term ``sparse distributed'' is often employed, such as in the interpretation of the results of single-cell recording studies, the latter formulation is more consistent with what is actually observed. As will be pointed out later, however, it is not really clear what distinction can be made between a sparse distributed scheme defined in this way and the localist schemes discussed above -- after all, the localist IAM model would be classified as sparse distributed under this looser but more plausible definition. If the class of sparse distributed networks is defined so as to include both localist and nonlocalist networks as subclasses (as is often the case), then statements advocating the use of sparse distributed representation cannot be interpreted as a rejection of localist models.
A similar problem exists with the term ``semidistributed.'' French (1992) discusses two systems he describes as semidistributed. The first is Kanerva's sparse distributed memory (Kanerva 1988), a network of binary neurons inspired more by a digital computer metaphor than by a biological metaphor, but which nonetheless shows good tolerance to inference (principally due to the similarities it shares with the localist models described here). The second is Kruschke's (1990) ALCOVE model, which (in its implemented version at least) would be classified under the present definition as localist. French developed a third type of semidistributed network, using an algorithm which sought to ``sharpen'' hidden unit activations during learning. Unfortunately, this semidistributed network only semisolved the interference problem to which it was addressed, in that even small amounts of later learning could interfere drastically with the ability to perform mappings learned earlier. What advantage there was to be gained from using a semidistributed network was primarily to be found in a measure of time to relearn the original associations -- some compensation but hardly a satisfactory solution to the interference problem itself.
It is informative to note that French's (1992) motivation for using a semidistributed rather than a localist network was based on his assumption that localist models acquire their well-known resistance to interference by sacrificing their ability to generalize. In what follows I will question this common assumption, and others regarding localist models, thus weakening the motivation to seek semidistributed solutions to problems which localist networks already solve.
Before launching into the detail of my remaining argument, I will first signpost what can be expected of the remainder of this target article. This is necessary because, as will be seen, on the way to my conclusion I make some moderately lengthy, but I hope interesting, digressions. These digressions may seem especially lengthy to those for whom mathematical modelling is of little interest. Nonetheless, I hope that the end justifies the means, particularly since the approach adopted here results in a model which is practically equivalent to several mathematical models, but with most of the mathematics taken out.
It seems essential, in writing a paper such as this, to emphasize the positive qualities of localist models as much as to note the shortcomings of their fully distributed counterparts. In the next part of the paper, accordingly, I develop a generalized localist model, which despite its simplicity, is able to exhibit generalization and attractor behaviour, abilities more commonly associated with fully distributed models. This is important because the absence of these abilities is often cited as a reason for rejecting localist models. The generalized localist model is also able to perform stable supervised and unsupervised learning and qualitatively to model effects of age of acquisition, both of which appear difficult for fully distributed models. The model is further shown to exhibit properties compatible with some mathematical formulations of great breadth, such as the Luce choice rule and the ``power law of practice'', thus extending the potential scope of its application.
In later sections I consider why, given the power of localist models, some psychological modellers have been reluctant to use them. These parts of the paper identify what I believe to be common misconceptions in the literature, in particular, those based on conclusions drawn from the domain of neuroscience. Finally, I address some of the problems of a fully distributed approach and identify certain inadequacies in some of the measures that have been proposed to overcome these problems.
In this section I shall describe, in general terms, a localist approach to both the unsupervised learning of representations and the supervised learning of pattern associations. In characterizing such a localist approach I have sought to generalize from a number of different models (e.g. Burton 1994; Carpenter & Grossberg 1987a; 1987b; Foldiak 1991; Kohonen 1984; Murre 1992; Murre, Phaf & Wolters 1992; Nigrin 1993; Rumelhart & Zipser 1986). These models differ in their details but are similar in structure and I shall attempt to draw together the best features of each. The resulting generalized model will not necessarily be immediately applicable to any particular research project but it will, I hope, have sufficient flexibility to be adapted to many modelling situations.
As a first step in building a localist system, I will identify a very simple module
capable of unsupervised, self-organized learning of individual patterns and/or pattern
classes. This work draws heavily on the work of Carpenter and Grossberg and colleagues
(e.g. Carpenter & Grossberg 1987a; 1987b; a debt that is happily acknowledged), with a
number of simplifications. The module (see Figure 3)
comprises two layers of nodes,
and
, fully connected to each
other by modifiable, unidirectional (
-to-
)
connections, which, prior to learning, have small, random weights,
.
(Throughout the paper,
will refer to the weight of the connection
from the
node in the originating layer to the
node in the receiving layer.) For simplicity of exposition, the
nodes in the lower layer will be deemed to be binary, that is, to have activations
(denoted
) either equal to zero or to one. The extension
to continuous activations will usually be necessary and is easily achieved. The input to
the nodes in the upper layer will simply be sum of the activations at the lower layer
weighted by the appropriate connection weights. In fact, for illustrative purposes, I
shall assume here that this input to a given node is divided by a value equal to the sum
of the incoming weights to that node plus a small constant (see e.g. Marshall 1990)
-- this is just one of the many so-called ``normalization'' schemes typically used with
such networks. Thus the input,
, to an upper-layer node is given by
where
is the small constant. Learning of patterns
of activation at the lower layer,
, is simply achieved as follows. When a pattern
of activation is presented at
, the inputs,
, to nodes in the
upper layer,
, can be calculated. Any
node
whose vector of incoming weights is parallel (i.e. a constant multiple of) the vector of
activations at
will have input,
equal to
. Any
node whose vector of incoming weights is orthogonal to the current input vector (that is
nodes for which
where
) will have zero
input. Nodes with weight vectors between these two extremes, whose weight vectors
``match'' the current activation vector to some nonmaximal extent, will have intermediate
values of
. Let us suppose that, on presentation of a
given
pattern, no
node achieves an input,
, greater
than a threshold
. (With
set appropriately,
this supposition will hold when no learning has yet been carried out in the
-to-
connections.) In this case, learning of the current input pattern will proceed. Learning
will comprise setting the weights incoming to a single currently ``uncommitted''
node (i.e. a node with small, random incoming weights) to equal the corresponding
activations at
-- a possible mechanism is discussed later. The
learning rule might thus be stated
where
indexes the single
node at which
learning is being performed,
is the learning rate and, in the case of
so-called ``fast learning'', the weight values reach their equilibrium values in one
learning episode, such that
. The
node indexed by
is thereafter labelled as being ``committed'' to the activation pattern at
, and will
receive its maximal input on subsequent presentation of that pattern.
![]() |
The course of learning is largely determined by the setting of the threshold,
. This is closely analogous to the vigilance parameter in the adaptive
resonance theory (ART) networks of Carpenter and Grossberg (1987a; 1987b), one version of
which (ART2a, Carpenter, Grossberg & Rosen 1991) is very similar to the network
described here. More particularly, if the threshold is set very high, for example
, then each activation pattern presented will lead to a learning
episode involving commitment of a previously uncommitted
node, even if the
same pattern has already been learned previously. If the threshold is set slightly lower,
then only activation patterns sufficiently different from previously presented patterns
will provoke learning. Thus novel patterns will come to be represented by a newly assigned
node, without interfering with any learning that has previously been
accomplished. This is a crucial point in the debate between localist and distributed
modellers and concerns ``catastrophic interference'', a topic to which I shall return in
greater depth later.
The value of the threshold,
, need not necessarily remain constant over
time. This is where the concept of vigilance is useful. At times when the vigilance of a
network is low, new patterns will be unlikely to be learned and responses (see below) will
be based on previously acquired information. In situations where vigilance,
, is set high, learning of the current
pattern is likely to
occur. Thus learning can, to some extent, be influenced by the state of vigilance in which
the network finds itself at the time of pattern presentation.
In order to make these notions more concrete, it is necessary to describe what
constitutes a response for the network described above. On presentation of a pattern of
activation at
, the inputs to the
nodes can be
calculated as above. These inputs are then thresholded so that the net input,
to the upper-layer nodes is given by
so that nodes with
less than
will receive no net
input, other nodes receiving net input equal to the degree by which
exceeds
. Given these net inputs, there are several options as to how we might
proceed. One option is to allow the
activations to equilibrate via the differential
equation
which reaches equilibrium when
, that is when
is some
function,
, of the net input. A common choice is to assume
that
is the identity function, so that the activations equal the net inputs. Another
option, which will be useful when some sort of decision process is required, is to allow
the
nodes to compete in some way. This option will be developed in some detail
in the next section because it proves to have some very attractive properties. For the
moment it is sufficient to note that a competitive process can lead to the selection of
one of the
nodes. This might be achieved by some winner-takes-all
mechanism, by which the
nodes compete for activation until one of them
quenches the activation of its competitors and activates strongly itself. Or it may take
the form of a ``horse-race'', by which the
nodes race, under the influence of
the bottom-up inputs,
, to reach a
criterion level of activation,
. Either way, in the absence of noise, we will
expect the
node with the greatest net input,
, to be selected. In the case where the
node is selected, the pattern at
will deemed to have fallen into the
pattern class. (Note that in a high-
regime there may be as many classes as patterns presented.) On
presentation of a given input pattern, the selection (in the absence of noise) of a given
node indicates that the current pattern is most similar to the pattern learned by the
selected node, and that the similarity is greater than some threshold value.
To summarize, in its noncompetitive form, the network will respond such that the
activations of
nodes, in response to a given input (
)
pattern, will equilibrate to values equal to some function of the degree of similarity
between their learned pattern and the input pattern. In its competitive form the network
performs a classification of the
activation pattern, where the classes
correspond to the previously learned patterns. This results in sustained or
super-criterion activation (
) of the node that has previously
learned the activation pattern most similar to that currently presented. In both cases,
the network is self-organizing and unsupervised. ``Self-organizing''
refers to the fact that the network can proceed autonomously, there being, for instance,
no separate phases for learning and for performance. ``Unsupervised'' is used here to mean
that the network does not receive any external ``teaching'' signal informing it how it
should classify the current pattern. As will be seen later, similar networks will be used
when supervised learning is required. In the meantime, I shall simply assume that the
selection of an
node will be sufficient to elicit a response
associated with that node (cf. Usher & McClelland 1995).
In this section I will give details of one way in which competition can be introduced into the simple module described above. Although the network itself is simple, I will show that it has some extremely interesting properties relating to choice accuracy and choice reaction-time. I make no claim to be the first to note each of these properties; nonetheless, I believe the power which in combination they afford has either gone unnoticed or has been widely underappreciated.
Competition in the layer
is simulated using a standard ``leaky
integrator'' model which describes how several nodes, each driven by its own input signal,
activate in the face of decay and competition (i.e. inhibition) from each of the other
nodes:
where
is a decay constant,
is the excitatory
input to the
node which is perturbed by zero-mean
Gaussian noise,
, with variance
,
is a self-excitatory term,
represents lateral inhibition from other nodes in
,
and
represents zero-mean Gaussian noise with variance
. The value of the
noise term
remains constant over the time course of a
single competition since it is intended to represent inaccuracies in ``measurement'' of
. By contrast, the value of
varies with each time step,
representing moment-by-moment noise in the calculation of the derivative. Such an equation
has a long history in neural modelling, featuring strongly in the work of Grossberg from
the mid 1960s onwards and later in, for instance, the cascade equation of
McClelland (1979).
Recently, Usher and McClelland (1995) have used such an equation to model the
time-course of perceptual choice. They show that, in simulating various two-alternative
forced choice experiments, the above equation subsumes optimal classical diffusion
processes (e.g. Ratcliff 1978) when a response criterion is placed on the difference
between the activations,
, of two competing nodes. Moreover, they show
that near optimal performance is exhibited when a response criterion is placed on the
absolute value of the activations (as opposed to the difference between them) in cases
where, as here, mutual inhibition is assumed. The latter case is easily extensible to
multiway choices. This lateral inhibitory model therefore simulates the process by which
multiple nodes, those in
, can activate in response to noisy, bottom-up,
excitatory signals,
, and compete
until one of the nodes reaches a response criterion based on its activation alone. Usher
and McClelland (1995) have thus shown that a localist model can give a good account
of the time course of multiway choices.
Another interesting feature of the lateral inhibitory equation concerns the accuracy
with which it is able to select the appropriate node (preferably that with the largest
bottom-up input,
) in the presence of input noise,
, and fast-varying noise,
. Simulations show that in the case
where the variances of the two noise terms are approximately equal, the effects of the
input noise,
, dominate -- this is simply because the leaky
integrator tends to act so as to filter out the effects of the fast-varying noise,
.
As a result, the competitive process tends to ``select'' that node which receives the
maximal noisy-input,
.
This process, by which a node is chosen by adding zero-mean Gaussian noise to its input
term and picking the node with the largest resultant input, is known as a Thurstonian
process (Thurstone 1927, Case V). Implementing a Thurstonian process with a
lateral-inhibitory network of leaky integrators, as above, rather than by simple
computation, allows the dynamics as well as the accuracy of the decision process to be
simulated.
The fact that the competitive process is equivalent to a classical Thurstonian
(noisy-pick-the-biggest) process performed on the inputs,
, is extremely useful
because it allows us to make a link with the Luce choice rule (Luce 1959), ubiquitous in
models of choice behaviour. This rule states that, given a stimulus indexed by
, a set of
possible responses indexed by
, and some set of similarities,
between the stimulus and those stimuli associated with each member of the response set,
then the probability of choosing any particular response
when presented with a
stimulus
is
Naturally this ensures that the probabilities add to one across the whole response set.
Shepard (1958 1987) has proposed a law of generalization that states, in this context,
that the similarities of two stimuli are an exponential function of the distance between
those two stimuli in a multidimensional space. (The character of the multidimensional
space that the stimuli inhabit can be revealed by multidimensional scaling applied to the
stimulus-response confusion matrices.) Thus
where
where the distance is measured in
-dimensional space,
represents the
coordinate of stimulus
along the
dimension and
represents a scaling
parameter for distances measured along dimension
. The scaling parameters
simply weight the contributions of different dimensions to the overall distance measure,
much as one might weight differently various factors such as reliability and colour when
choosing a new car. Equation 7 is known as the Minkowski power
model formula and
reduces to the ``city-block'' distance for
and the Euclidean distance for
, these two measures being those most commonly
used.
So how does the Luce choice rule acting over exponentiated distances relate to the
Thurstonian (noisy-choice) process described above? The illustration is easiest to perform
for the case of two-alternative choice, and is the same as that found in
McClelland (1991). Suppose we have a categorization experiment in which a subject
sees one exemplar of a category A and one exemplar of a category B. We then present the
subject with a test exemplar,
, and ask them to decide whether it should be
categorized as being from category A or from category B. Suppose further that each of the
three exemplars can be represented by a point in an appropriate multidimensional space
such that the test exemplar lies at a distance
from the A exemplar and
from the B exemplar. This situation is illustrated for a two-dimensional
space in Figure 4. Note that coordinates on any given
dimension are given in terms of the relevant scaling parameters, with increase of a given
scaling parameter resulting in an increase in magnitude of distances measured along that
dimension and contributed to the overall distance measures
and
. (It is
for this reason that an increase of scaling parameter along a given dimension is often
described as a stretching of space along that dimension.) The Luce choice rule with
exponential generalization implies that the probability of placing the test exemplar in
category A is
dividing through by
gives
which equals 0.5 when
. This function is called the logistic
function and it is extremely similar to the function describing the (scaled) cumulative
area under a Normal (i.e. Gaussian) curve. This means that there is a close correspondence
between the two following procedures for probabilistically picking one of two responses at
distances
and
from the current stimulus: one can
either add Gaussian noise to
and
and pick the category corresponding
to, in this case, the smallest resulting value (a Thurstonian process); or one can
exponentiate the negative distances and pick using the Luce choice rule. The two
procedures will not give identical results, but in most experimental situations will be
indistinguishable (e.g. Kornbrot 1978; Luce 1959; Nosofsky 1985; van Santen & Bamber
1981; Yellott 1977). (In fact, if the noise is double exponential rather than Gaussian the
correspondence is exact, see Yellott 1977.)
![]() |
The consequences for localist modelling of this correspondence, which extends to
multichoice situations, are profound. Two things should be borne in mind. First, that a
point in multidimensional space can be represented by a vector of activations across a
layer of nodes, say the
layer of the module discussed earlier, and/or
by a vector of weights, perhaps those weights connecting the set of
nodes to
a given
node. Second, that taking two nodes with activations
and
, adding zero-mean Gaussian noise and picking the node with the smallest
resulting activation is equivalent, in terms of the node chosen, to taking two nodes with
activations
and
(where
is a constant), adding the same zero-mean Gaussian noise and this time picking the node
with the largest activation. Consequently, suppose that we have two
nodes,
such that each has a vector of incoming weights corresponding to the multidimensional
vector representing one of the training patterns, one node representing pattern
,
the other node representing pattern
. Further, suppose that on presentation of the
pattern of activations corresponding to the test pattern,
, at layer
,
the inputs,
, to the two nodes are equal to
and
respectively (this is easily achieved).
Under these circumstances, a Thurstonian process, like that described above, which noisily
picks the
node with the biggest input, will give a
pattern of probabilities of classifying the test pattern either as an
or,
alternatively, as a
, which is in close correspondence to the pattern
of probabilities that would be obtained by application of the Luce choice rule to the
exponentiated distances,
(Equation 9).
This correspondence means that mathematical models, such as Nosofsky's generalized
context model (Nosofsky 1986), can be condensed, doing away with the stage in which
distances are exponentiated, and the stage in which these exponentiated values are
manipulated, in the Luce formulation, to produce probabilities (a stage for which, to my
knowledge, no simple ``neural'' mechanism has been suggested 1), leaving a
basic Thurstonian noisy-choice process, like that above, acting on (a constant minus) the
distances themselves. Since the generalized context model (Nosofsky 1986) is, under
certain conditions, mathematically equivalent to the models of Estes (1986), Medin
and Schaffer (1978), Oden and Massaro (1978), Fried and Holyoak (1984) and
others (see Nosofsky 1990), all these models can, under those conditions, similarly be
approximated to a close degree by the simple localist connectionist model described here.
A generalized exemplar model is obtained under high-
(i.e. high
threshold or high vigilance) conditions, when all training patterns are stored as weight
vectors abutting a distinct
node (one per node).
We can now raise the question of what happens when, in the simple two-choice
categorization experiment discussed above, the subject sees multiple presentations of each
of the example stimuli before classifying the test stimulus. To avoid any ambiguity I will
describe the two training stimuli as ``exemplars'' and each presentation of a given
stimulus as an ``instance'' of the relevant exemplar. For example, in a given experiment a
subject might see ten instances of a single A-category exemplar and ten instances of a
single B-category exemplar. Let us now assume that a high-
classification
network assigns a distinct
node to each of the ten instances of the
category A exemplar and to each of the ten instances of the category B exemplar. There
will thus be twenty nodes racing to classify any given test stimulus. For simplicity, we
can assume that the learned, bottom-up weight vectors to
nodes representing
instances of the same exemplar are identical. On presentation of the test stimulus, which
lies at distance
(in multidimensional space) from instances of
the category A exemplar and distance
from instances of the category B exemplar, the
inputs to the
nodes representing category A instances will be
, where
is, as before, the input-noise term
and the subscript
(
) indicates that the
zero-mean Gaussian noise term will have a different value for each of the ten nodes
representing different instances. The
nodes representing instances of the
category B exemplar will likewise have inputs equal to
for
. So what is the effect on performance of adding these extra
instances of each exemplar? The competitive network will once again select the
node with the largest noisy input. It turns out that, as more and more instances of each
exemplar are added, two things happen: firstly, the noisy-pick-the-biggest process becomes
an increasingly better approximation to the Luce formulation, until for an asymptotically
large number of instances the correspondence is exact; secondly, performance, as measured
by the probability of picking the category whose exemplar falls closest to the test
stimulus, improves, a change which is equivalent to stretching the multidimensional space
in the Luce formulation by increasing by a common multiplier the values of all the scaling
parameters
in the distance calculation given in
Equation 7. For the mathematically
inclined, I note that both these effects come about due to the fact that the maximum value
out of
samples from a Gaussian distribution is itself cumulatively distributed as a
double exponential,
, where
is a constant. The
distribution of differences between two values drawn from the corresponding density
function is a logistic function, comparable with that implicit in the Luce choice rule
(for further details see e.g. Yellott 1977; Page & Nimmo-Smith in preparation).
To summarize, for single instances of each exemplar in a categorization experiment,
performance of the Thurstonian process is a good enough approximation to that produced by
the Luce choice rule so as to make the two difficult to distinguish by experiment. As more
instances of the training exemplars are added to the network, the Thurstonian process
makes a rapid approach towards asymptotic performance that is precisely equivalent to that
produced by application of the Luce choice rule to a multidimensional space that has been
uniformly ``stretched'' (by increasing by a common multiplier the values of all the
scaling parameters
in Equation 7)
relative to that space (i.e. that set of scaling parameters) that might have been inferred
from application of same choice rule to the pattern of responses found after only a single
instance of each exemplar had been presented. It should be noted that putting multiple
noiseless instances into the Luce choice rule will not produce an improvement in
performance relative to that choice rule applied to single instances -- in mathematical
terminology, the Luce choice rule is insensitive to uniform expansion of the set (Yellott
1977).
Simulations using the Luce formulation (e.g. Nosofsky 1987) have typically used uniform
multiplicative increases in the values of the dimensional scaling parameters (the
in Equation 7) to account for performance
improvement over training blocks. The Thurstonian process described here, therefore, has
the potential advantage that this progressive space-stretching is a natural feature of the
model as more instances are learned. Of course, until the model has been formally fitted
to experimental data the suggestion of such an advantage must remain tentative - indeed
early simulations of data from Nosofsky (1987) suggest that some parametric stretching of
stimulus space is still required to maintain the excellent model-to-data fits that
Nosofsky achieved (Page & Nimmo-Smith in preparation). Nonetheless, the present
Thurstonian analysis potentially unites a good deal of data, as well as raising a
fundamental question regarding Shepard's ``universal law of generalization'': could it be
that the widespread success encountered when a linear response rule (Luce) is applied to
representations with exponential generalization-gradients in multidimensional
stimulus-space, is really a consequence of a Thurstonian decision process acting on an
exemplar model, in which each of the exemplars actually responds with a linear
generalization gradient? It is impossible in principle to choose experimentally between
these two characterizations for experiments containing a reasonable number of instances of
each exemplar; the Thurstonian (noisy-pick-the-biggest) approach has the advantage that
its ``neural'' implementation is, it appears, almost embarrassingly simple.
On the basis of the work described above, we can conclude that a simple localist,
competitive model is capable of modelling data relating to both choice reaction-time and
choice probability. In this section I will make a further link with the so-called
``power-law of practice''. There is a large amount of data which support the conclusion
that reaction-time varies as a power function of practice, that is
, where
is the number of previous practice
trials, and
,
and
are positive constants
(see Logan 1992 for a review). In a number of papers, Logan (1988; 1992) has proposed that
this result can be modelled by making the following assumptions. First, each experience
with a stimulus pattern is obligatorily stored in memory and associated with the
appropriate response. Second, on later presentation of that stimulus pattern, all
representations of that stimulus race with each other to reach a response criterion, the
response time being the time at which the first of the representations reaches that
criterion. The distribution of times-to-criterion for a given representation is assumed to
be the same for all representations of a given stimulus, and independent of the number of
such representations. Logan has shown that if the time-to-criterion for a given stimulus
representation is distributed as a Weibull function (which, within the limits of
experimental error, it is -- see Logan 1992), then, using minimum-value theorem, the
distribution of first-arrival times for a number of such representations is a related
Weibull function, giving a power function of practice. He has collected a variety of data
which indicate that this model gives a good fit to data obtained from practiced tasks, and
thus a good fit to the power law of practice (e.g. Logan 1988; 1992). At several points,
Logan makes analogies with the exemplar models of categorization and identification
discussed above but does not develop the point further.
This link, between Logan's instance model and exemplar models of categorization, has, however, been most fully developed in two recent papers by Robert Nosofsky and Thomas Palmeri (Nosofsky & Palmeri 1997; Palmeri 1997). While they draw on Logan's instance theory of reaction-time (RT) speed-up, they identify a flaw in its behaviour. Their criticism involves observing that Logan's theory does not take stimulus similarity into account in its predictions regarding RT. To illustrate their point, suppose we have trained our subject, as above, with ten instances of a category-A exemplar and ten instances of a category-B exemplar. Consistent with Logan's instance theory, had we tested performance on the test exemplar at various points during training we would have observed some RT speed-up -- the RT is decided by the time at which the first of the multiple stored instances crosses the winning line in a horse-race to criterion. The time taken for this first-crossing decreases as a power-law of the number of instances. But what happens if we now train with one or more instances of an exemplar which is very similar to the category-A exemplar, but is indicated as belonging to category-B? In Logan's model, the addition of this third exemplar can only speed up responses when the category-A exemplar is itself presented as the test stimulus. But intuitively, and actually in the data, such a manipulation has the opposite effect, that is, it slows responses.
Nososfsky and Palmeri (1997) solve this problem by introducing the exemplar-based random walk model (EBRW). In this model, each training instance is represented in memory, just as in Logan's model. Unlike in Logan's model, however, the response to a given stimulus is not given by a single race-to-criterion. For exemplar similarities to have the appropriate effect, two changes are made: First, each exemplar races with a speed proportional to its similarity to the test stimulus. Second, the result of a given race-to-criterion does not have an immediate effect on the response but instead drives a random-walk category-decision process similar to that found in Ratcliff's (1978) diffusion model -- multiple races-to-criterion are held consecutively, with their results accumulating until a time when the cumulative number of results indicating one category exceeds the cumulative number indicating the other category by a given margin. Briefly, as the number of instances increases, each of the races-to-criterion takes a shorter time, as in the Logan model; to the extent that the test stimulus falls clearly into one category rather than another, the results of consecutive races will be consistent in indicating the same category and the overall response criterion will be more quickly reached.
I now suggest an extension of the Thurstonian model developed earlier which accounts,
qualitatively at least, for the data discussed by Nosofsky and Palmeri (1997) and
addresses the problems inherent in Logan's instance model. The network is depicted in
Figure 5. The activation pattern across the lower layer,
,
is equivalent to the vector in multidimensional space representing the current test
stimulus. Each node in the middle layer
represents a previously learned
instance and activates to a degree inversely and linearly related to the distance between
that learned instance and the current test stimulus (as before), plus zero-mean Gaussian
noise,
. The third layer of nodes,
, contains a node for each of the
possible category responses;
nodes are linked by a connection of unit
strength to the appropriate
response node and to no other. The effective
input to each of the
nodes is given by the maximum activation value
across the connected
nodes. Thus the input driving a category-A
response will be the maximum activation across those
nodes associated with
a category-A response. The response nodes in
compete using lateral-inhibitory
leaky-integrator dynamics, as before. This competition, in the absence of large amounts of
fast-varying noise, will essentially pick the response with the largest input, thus
selecting, as before, the category associated with the
instance node with the
largest noise-perturbed activation. The selection process therefore gives identical
results, in terms of accuracy of response, to the simple Thurstonian model developed
earlier and hence maintains its asymptotic equivalence with the Luce choice rule. The
reaction time that elapses before a decision is made depends on two things: the number of
instance representations available in
; and the strength of competition
from alternative responses. As more instances become represented at
, the
maximum value of the noisy activations creeps up, according to the maximum-value theorem,
thus speeding highly practiced responses. To the extent that a given test stimulus falls
near to instances of two different categories, the lateral inhibitory signals experienced
by the response node that is eventually selected will be higher, thus delaying the
response for difficult-to-make decisions, as required by the data.
Does the reaction-time speed-up with practice exhibited by this model fit the observed
power law? I have done many simulations, under a variety of conditions, all of which
produced the pattern of results shown in the graph in Figure 6.
The graph plots the mean reaction time, taken over 1000 trials, against number of
instances of each response category. As can be seen the speed-up in reaction time with
practice is fitted very well by a power function of practice,
. The fact
that the time axis can be arbitrarily scaled, and the exponent of the power curve can be
fitted by varying the signal-to-noise ratio on the input signals, bodes well for the
chances of fitting this Thurstonian model to the practice data -- this is the subject of
current work. We know already that this signal-to-noise ratio also determines the accuracy
with which the network responds, and the speed with which this accuracy itself improves
with practice. While it is clear that it will be possible to fit either the accuracy
performance or the RT performance with a given set of parameters, it remains to be seen
whether a single set of parameters will suffice for fitting both simultaneously.
![]() |
To summarize this section of the paper, I have illustrated how a simple, localist, lateral inhibitory network, fed by appropriate noisy bottom-up inputs delivered by some, presumably hierarchical, feature-extraction process, inherits considerable power from its close relationship to a number of classic mathematical models of behaviour. The network implements a Thurstonian choice-process which gives accuracy results which are indistinguishable (at asymptote) from those produced by application of the Luce choice rule to representations with exponential generalization gradients. It shows accuracy which improves with an increase in the number of training instances, equivalent in the Shepard-Luce-Nosofsky formulation to the uniform stretching of multidimensional space. With regard to reaction time, the network's RT distributions are very similar to those produced by Ratcliff's (1978) diffusion model (see Usher & McClelland 1995) and to those found in experimental data. The RTs reflect category similarities and are speeded as a power-law with practice. This simple localist model thus provides a qualitative synthesis of a large range of data and offers considerable hope that this breadth of coverage will be maintained when quantitative fitting is attempted.
To this point I have not, at times, made a clear distinction between supervised and unsupervised learning. Underlying this conflation is the belief that the mechanisms underlying the two types of behaviour largely overlap -- more particularly, that unsupervised learning is a necessary component of supervised, or association, learning. This will, I hope, become more clear later. First I shall describe qualitatively how variants of the localist model discussed above can exhibit a number of behaviours more commonly associated with distributed models, and at least one behaviour which has proved difficult to model.
It is often stated as one of the advantages of networks using distributed representations that they permit generalization, which means that they are able to deal appropriately with patterns of information they have not previously experienced by extrapolating from those patterns which they have experienced and learned. In a similar vein, such networks have been said to be robust, their performance worsening only gradually in the presence of noisy or incomplete input. Generalization and robustness are essentially the same thing: both refer to the networks' ability to deal with inputs which only partially match previous experience. One very common inference has been that networks using localist representations do not share these abilities. In this section I show that this inference is unjustified.
First, the previous sections have illustrated how one of the most resilient
mathematical models of the stimulus-response generalization process can be cast in the
form of a simple localist network. Put simply, in the face of a novel, or noisy, stimulus,
the input signals to a layer of nodes, whose weights encode patterns of activation
encountered in previous experiences, will reflect, in a graded fashion, the degree of
similarity that the current input shares with each of those learned patterns. If the
current pattern is not sufficiently similar to any learned pattern to evoke
super-threshold input, then no generalization will be possible, but to the extent that
similarities exist, the network can choose between competing classifications/responses on
the basis developed above. It will be possible to vary the breadth of generalization that
can be tolerated by varying the input threshold,
. Thus if no
node receives super-threshold input, yet generalization is required, the threshold can
simply be dropped until input, upon which a response can be based, is forthcoming.
Of course, the stimulus-response model described above only generalizes in the sense that it generates the most appropriate of its stock of previously learned responses on presentation of an unfamiliar stimulus. This type of generalization will not always be appropriate: imagine a localist system for mapping orthography to phonology, in which each familiar word is represented by a node which alone activates sufficiently in the presence of that word to drive a representation of that word's phonology. Would this system exhibit generalization on presentation of a novel orthographic string (i.e. a nonword)? Only in the sense that it would output the phonology of the word that best matched the unfamiliar orthographic input. This is not the sort of generalization that human readers perform in these circumstances; they are content to generate novel phonemic output in response to novel orthographic input. The localist approach to simulating this latter ability relies on the fact that in quasiregular mappings, like that between orthography and phonology, in which both the input pattern (i.e. the letter string) and the output pattern (i.e. the phonemic string) are decomposable into parts, and in which each orthographic part has a corresponding phonemic part with which it normatively corresponds, the localist model can perform generalization by input decomposition and output assembly. Specifically, although the unfamiliar orthographic string cannot activate a localist representation of the complete nonword (since by definition no such representation exists), it can lead to activation in localist units representing orthographic subparts, such as onset cluster, rime, vowel, coda etc., and each of these can in turn activate that portion of the phonological output pattern with which it is most usually associated. This idea of generalization by input decomposition and output assembly for nonwords, supplemented by a dominant, but not exclusive, direct route for known words is, of course, the strategy used by many localist modellers of single-word reading (Coltheart, Curtis, Atkins & Haller 1993; Norris 1994a; Zorzi, Houghton & Butterworth 1998).
In nonregular domains, where generalization by decomposition and assembly is not possible, the tendency of localist models either to fail to generalize or, when appropriate, to perform generalization to the best matching stock response, might be seen as a distinct advantage. (Many of the points that follow can also be found in Forster 1994, but I believe they bear repetition.) Take the mapping from orthography to semantics, or the mapping from faces to proper names: Is it appropriate to generalize when asked to name an unfamiliar face? Or when asked to give the meaning of a nonword? In a localist model of the general type developed above, the threshold for activating the localist representation of a known face or a known word can be set high enough such that no stock response is generated for such unfamiliar stimuli. When a stock response is required, such as to the question ``Which familiar person does this unfamiliar person most resemble?'', the input threshold might still be dropped, as described above, until a response is forthcoming. It is unclear whether distributed models of, for example, face naming or orthography-to-meaning mapping, particularly those with attractor networks employed to ``clean up'' their output, exhibit this sort of flexibility, rather than attaching spurious names to unfamiliar faces, spurious meanings to nonwords or spurious pronunciations to unpronounceable letter strings.
Are networks that automatically show generalization the most appropriate choice for implementing irregular mappings such as that between orthography and semantics? Forster (1994) suggests not, while McRae, de Sa and Seidenberg (1997), in rejecting Forster's pessimism, note that
``feedforward networks...can learn arbitrary mappings if provided with sufficient numbers of hidden units. Networks that are allowed to memorize a set of patterns sacrifice the ability to generalize, but this is irrelevant when the mapping between domains is arbitrary'' (p. 101).
McRae et al., however, do not show that ``sufficient numbers of hidden units'' would be significantly less than one for each word (i.e., an easily learned localist solution); and, even so, it is not clear what advantages such a distributed mapping would exhibit when compared with a localist lexical route, given that generalization is specifically not required. With regard to Forster's questions about the spurious activation of meaning by nonwords, McRae et al.'s simulations used a Hopfield-type model, with restrictions on data collection allowing them to model the learning of only 84 orthography-to-semantic mappings. A test of the resulting network, using just 10 nonwords, led to ``few or no [semantic] features'' being activated to criterion -- whether Forster would be persuaded by this rather qualified result is doubtful. Plaut et al. (1996) identify essentially the same problem in discussing their semantic route to reading. Their unimplemented solution involves semantic representations that are
`` relatively sparse, meaning each word activates relatively few of the possible semantic features and each semantic feature participates in the meanings of a very small percentage of words'' (p. 105)
and they add
``...this means that semantic features would be almost completely inactive without specific evidence from the orthographic input that they should be active. Notice that the nature of this input must be very specific in order to prevent the semantic features of a word like CARE from being activated by the presentation of orthographically similar words like ARE, SCARE, CAR, and so forth'' (p. 105).
Since the mapping between orthography and semantics clearly requires an intermediate layer of mapping nodes, it might seem easiest to ensure this exquisite sensitivity to orthographic input by making these mapping nodes localist lexical representations. Of course this would mean that the second route to reading was the type of localist lexical route the authors explicitly deny. It remains to be demonstrated that a genuinely distributed mapping could exhibit the requisite properties and, again, what advantages such a scheme would enjoy over the rather straightforward localist solution.
Finally, another type of generalization is possible with localist networks, namely, generalization by weighted interpolation. In such a scheme, localist representations of various familiar items activate to a level equal to some function of the degree that they match an unfamiliar input pattern, the combined output being an activation-weighted blend of the individual output patterns associated with each familiar item. This type of generalization is most appropriate in domains in which mappings are largely regular. A similar arrangement has been postulated, using evidence derived from extensive cell recording, for the mapping between activation of motor cortical neurons and arm movements in primates (Georgopoulos, Kettner & Schwartz 1988). Classifying this so-called population coding as a type of localist representation is perhaps stretching the notion farther than necessary (cf. our earlier comments regarding orientation columns), although it really amounts to no more than acknowledging that each cell in a population will respond optimally to some (presumably familiar) direction, albeit one located in a space with continuously varying dimensions. In some cases it might even be difficult to distinguish between this weighted-output decoding of the pattern of activation across what I'll call the coding layer and an alternative decoding strategy which imagines the cells of the coding layer as a set of localist direction-nodes racing noisily to a criterion, with the winner alone driving the associated arm movement.2A similar distinction has been explored experimentally by Salzman and Newsome (1994), who located a group of cells in rhesus monkey MT cortex, each of which responded preferentially to a given direction of motion manifested by a proportion of dots in an otherwise randomly moving dot pattern. The monkeys were trained on a task which required them to detect the coherent motion within such dot patterns and to indicate the direction of motion by performing an eight-alternative forced choice task. Once trained, the monkeys were presented with a pattern containing, for example, northerly movement while a group of cells with a preference for easterly movement was electrically stimulated to appropriate levels of activation. The responses of the monkeys indicated a tendency to respond with either a choice indicating north or one indicating east, rather than modally responding with a choice indicating the average direction north-east. The authors interpreted these results as being consistent with a winner-takes-all rather than a weighted-output decoding strategy. Implicit in this interpretation is the monkeys' use of a localist coding of movement direction. It is likely that both decoding strategies are used in different parts of the brain or, indeed, in different brains: The opposite result, implying a weighted output decoding strategy, has been found for those neurons in the leech-brain that are sensitive to location of touch (Lewis & Kristan 1998). More germanely, generalization by weighted output can be seen in several localist models of human and animal cognition (e.g. Kruschke 1992; Pearce 1994).
To summarize, contrary to an often repeated but seldom justified assumption, there are (at least) three ways in which localist models can generalize: by output of the most appropriate stock response; by input decomposition and output assembly; or by activation-weighted output.
Another much-discussed feature of networks employing distributed representations is their ability to exhibit ``attractor'' behaviour. In its most general sense (the one I shall adopt here) attractor behaviour refers to the ability of a dynamic network to relax (i.e. be ``attracted'') into one of several stable states following initial perturbation. In many content addressable memory networks, such as that popularized by Hopfield (1982; 1984), the stable states of the network correspond to previously learned patterns. Such attractor networks are often used to ``clean up'' noisy or incomplete patterns (cf. generalization). In mathematical terms, a learning algorithm ensures that learned patterns lie at the minima of some function (the Lyapunov function) of the activations and weights of the network. The activation-update rule ensures that, from any starting point, the trajectory that the network takes in activation space always involves a decrease in the value of the Lyapunov function (the network's ``energy''), thus ensuring that eventually a stable (but perhaps local) minimum point will be reached. Cohen and Grossberg (1983) describe a general Lyapunov function for content-addressable memories of a given form, of which the Hopfield network is a special case.
To see how attractor behaviour (again in its general sense) can be exhibited by a
variant of the localist network described above, assume that we have a two-layer network,
as before, in which the upper layer,
, acts as a dynamic, competitive,
winner-takes-all layer, classifying patterns at the lower layer,
. Let us
further assume that
nodes project activation to a third layer,
, the same size as
, via connections whose weights are the same as
those of the corresponding
-to-
connections (see Figure 7). For simplicity, let us assume that the input threshold,
is zero. On presentation of an input pattern at
, the
inputs to the
nodes will reflect the similarities (e.g. dot
products) of each of the stored weight vectors to this input pattern. If we track the
trajectory of the activation pattern at
as the competition for activation
at
proceeds, we will find that it starts as a low-magnitude amalgam of the
learned weight vectors, each weighted by its similarity to the current input pattern, and
ends by being colinear with one of the learned weight vectors, with arbitrary magnitude
set by the activation of the winning node. In the nondeterministic case, the
pattern will finish colinear with the weight vector associated with the
node receiving the largest noisy input,
. Thus the
activation vector is attracted to one of several stable points in weight-space, each of
which represents one of the learned input patterns. In the noisy case, given the results
presented earlier, the probability of falling into any given attractor will be describable
in terms of a Luce choice rule. This is precisely the sort of attractor behaviour we
require.
In certain cases we might allow the
nodes to project back down to the
nodes in
rather than to a third layer,
. In this case (reminiscent of the
ART networks referred to earlier), the activation pattern at
is attracted towards
one of the stable, learned patterns. This network is essentially an autoassociation
network with attractor dynamics. Such an implementation has some advantages over those
autoassociative attractor networks used by Hopfield and others. For instance, it should be
fairly clear that the capacity of the network, as extended by competing localist
representations, is, in the deterministic case, equal to the maximum number of nodes
available in
to learn an input pattern. In contrast with the
Hopfield network, the performance of the localist network is not hindered by the existence
of mixture states, or false minima, that is, minima of the energy function that do not
correspond to any learned pattern. Thus localist attractor networks are not necessarily
the same as their fully distributed cousins, but they are attractor networks nonetheless:
whether or not a network is an attractor network is independent of whether or not it is
localist.
Since one can view the lateral inhibitory module as performing a categorization of the
activation pattern, the category being signalled by the identity of the winning
node, the network can naturally model so-called categorical perception effects
(see e.g. Harnad 1987). Figure 8 illustrates the
characteristic sharp category-response boundary that is produced when two representations,
with linear generalization gradients, compete to classify a stimulus that moves between
ideal examples of each category. In essence, the treatment is similar to that of
Massaro (1987), who makes a distinction between categorical perception, and
``categorical partitioning'', whereby a decision process acts on a continuous (i.e.
noncategorical) percept. This distinction mirrors the one between a linear choice rule
acting on representations with exponential generalization gradients and a Thurstonian
choice-process acting on representations with linear generalization gradients, as seen
above. The fact that Massaro describes this partitioning process in Thurstonian terms, yet
models it using the Fuzzy Logic Model of Perception (Oden & Massaro 1978), serves
to emphasize the strong mathematical similarities between the two approaches.
![]() |