For information about subscribing or purchasing offprints of the published version, with commentaries and author's response, write to: journals_subscriptions@cup.org (North America) or journals_marketing@cup.cam.ac.uk (All other countries).
Cell assemblies; cerebral cortex; coordination; context; dynamic binding; epistemology; functional specialization; learning; Neural coding; neural computation; neuropsychology; reading; object recognition; perception; self-organization; synaptic plasticity; synchronization.
This research concerns forms of coding, processing and learning that are common to many different cortical regions and cognitive functions. Local cortical processors may coordinate their activity by maximizing the transmission of information that is coherently related to the context in which it occurs, thereby forming synchronized population codes. In this coordination, contextual field (CF) connections link processors within and between cortical regions. The effects of CF connections are distinct from those mediating receptive field (RF) input. CFs can guide both learning and processing without becoming confused with RF information. Simulations explore the capabilities of networks built from local processors with both RF and CF connections. Physiological evidence for CFs, synchronization, and plasticity in RF and CF connections is described. Coordination via CFs is related to perceptual grouping, the effects of context on contrast sensitivity, amblyopia, implicit influences of color in achromotopsia, object and word perception, and the discovery of distal environmental variables and their interactions through self-organization. In cortical computation there may occur a flexible evaluation of relations between input signals by locally specialized but adaptive processors whose activity is dynamically associated and coordinated within and between regions through specialized contextual connections.
1. Introduction
2. Arguments for and against common foundations for cortical computation
3. Computational studies of the contextual guidance of learning and processing
4. Physiological evidence for contextual integration and synchronized population
codes
5. Psychological implications and evidence
6. Issues arising
Notes
References
Figure Legends
The possibility of common foundations for cortical computation was first discussed by the authors (Phillips a psychologist, and Singer a neurophysiologist) in 1980. We had collaborated in the early 1970s, comparing single unit activity in cat lateral geniculate nucleus with the ability of humans to detect the appearances and disappearances of elements in random dot patterns (Phillips & Singer 1974; Singer & Phillips 1974). This background was important because it had helped convince us that psychophysics and neurophysiology could combine fruitfully and in detail. Since then we had not met for some years and the neurophysiologist asked what the psychologist's current interests were. The ensuing conversation went roughly as follows:
Psychologist: : Well, my main interest is in the fundamental differences between different cognitive domains. For example, what are the basic differences between sensori-motor systems and the higher conceptual systems? Then, within the conceptual systems what are the basic differences between visuo-spatial processing and verbal processing?
Neurophysiologist : But why are you emphasizing differences? The cortical algorithm is everywhere the same.
Psychologist: : Well if that is so it is very interesting, but from the psychological point of view there certainly seem to be some major differences. Consider learning and memory, for example. Information storage within the sensory systems is of very short duration, less than a second in visual sensory storage, whereas once it is put into a schematic conceptual form information can be voluntarily maintained for many seconds in STM, and can be learned and stored indefinitely in LTM.
Neurophysiologist : But there is also long-term plasticity in sensory systems, both during development and later. The receptive fields of cells in primary sensory cortex depend upon the stimulation that they get during development, and these use- dependent modifications of synaptic transmission can also occur in adults.
Psychologist: : Yes, of course, such effects are well established, but that is a quite different kind of learning.
Neurophysiologist : Well is it? Why do you suppose that learning and processing in the sensory cortex are fundamentally different from learning and processing elsewhere in cortex? Perhaps they are very similar, and from the neurophysiological point of view that's how it seems.
We are still searching for answers to questions raised by this discussion. Are there information processing operations that are common to different cortical regions and different cognitive sub- systems, and if so, what are these operations, why are they useful, and how are they implemented by cortical processes ? Different cognitive functions are of course performed by different cortical regions and at different levels of organization, but all regions of the neocortex share a common basic internal organization, and because of this predominant homogeneity it is also called isocortex. Computational capabilities of general utility may therefore arise from this common design. This paper is concerned with what those capabilities might be and with how they arise from cortical structures and processes.
The organization of cognition into distinct sub-systems is even more firmly established now than it was twenty years ago. This does not imply differences in the information processing operations that they perform, however, because sub-systems may differ in the information upon which they operate, but not in the operations that they perform upon that information. Many cognitive sub-systems are distinguished from each other just in terms of the information on which they operate, but it is also likely that some cognitive functions require special information processing capabilities. These include: episodic memory and working memory; intentional representation, i.e. processes that distinguish between representation and referent; and the creative aspects of language and long-range strategic planning. Higher cognitive functions such as these are central to human mental life, and they depend to a large extent upon cortical activity. These functions may not arise in any simple way from basic capabilities that are common to cortex in general, however, because (i) intentional representation and language, are not characteristic of mammals in general but are restricted to just one or at most a few; (ii) in contrast to skills, episodic memories cannot be acquired in the absence of the hippocampus (Squire 1992), and may require special computational capabilities ( McClelland et al. 1995); and (iii) the ability to dynamically create more than one level of grouping within the same set of units, such as ((AB)(CD)), may involve special computational problems (Fodor & Pylyshyn 1988; Hummel & Holyoak 1993). Thus our working assumption is that some cognitive functions require special capabilities in addition to those that are common to cortex in general. Furthermore, although we take the abilities that are provided by the common foundations for granted, they are crucial to the sensory, perceptual, and motor skills on which our daily lives depend.
The notion of functional specialization summarizes the vast body of findings showing that different cortical regions and different cells within regions transmit information about different things. Discussions of how the activity of these distinct processors can be coordinated have an equally long history, but this aspect remains much less well understood. The particular form of integration with which this paper is concerned is that which arises from a myriad of local coordinating interactions between pyramidal cells within and between cortical regions. This does not deny that the music of the hemispheres might be guided by some kind of conductor, but it does imply that integration can be achieved, at least in part, through local interactions between the players themselves. Musicians have two different sources of information which they normally use in two different ways. They have the score to tell them what to play, but they also watch and listen to each other to determine exactly when and how loudly to play it. The local processors that we postulate also have two classes of input. One is the receptive field input which tells them what features to signal, and the other is contextual input from the concurrent activity of other processors which is used to determine exactly when and how confidently to signal the features for which they have evidence.
A simple, general, and precise framework for describing functional specialization in neural systems is provided by the adaptive filter formulation (Carpenter, 1989). The basic idea is that the strengths of the synapses that mediate receptive field input perform a selective filtering operation that can be adapted through experience to better meet the environment and tasks to which the system is exposed. Filtering is necessary for at least two reasons: i) the amount of sensory data to be processed is so great that predictive relationships can only be found after dimensionality reduction; and ii) different information is relevant to different purposes.
We now need to add contextual integration to such formulations. Filtering is useful because it contributes to the more general goal of making good predictions. Predictive relationships of varying degrees of complexity are richly embedded within the input to the cortex, across both space and time, and the discovery and use of these relationships is a major goal of cortical computation at all stages and levels of processing. The integrative interactions that we hypothesize can be thought of as using these predictive relationships to produce patterns of activity that are coherent both within and between various streams and levels of processing. There is evidence that this involves synchronizing the activity of dynamically specified sub-sets of cells, using special synchronizing contextual connections that influence the probability that the target cells fire at any moment (e.g. Singer 1990, 1993, 1994, 1995; Singer & Gray 1995; Engel et al. 1992). A crucial aspect of this form of integration is that context affects activity without corrupting the information that is transmitted by that activity about the cell's receptive field input. Here we summarize the evidence for synchronization and for contextual connections, and we study computational capabilities that arise when cortical processors receive local contextual inputs that can be used to guide both learning and processing.
In the remainder of Section 1 we outline the issues and hypotheses to be discussed. Section 1.2 gives an informal outline of the possibilities that arise when local cortical processors coordinate their activities by using specialized contextual inputs to form synchronized population codes and to guide learning. Section 1.3 relates the codes and processes that we propose to other aspects of cortical function. Section 1.4 reviews prior proposals using synchronized population codes and contextual guidance. Section 1.5 summarizes the main hypotheses that we expect to be controversial. Section 2 outlines arguments for and against the hypothesis of common foundations for cortical computation. Section 3 specifies the goals of contextual guidance formally using concepts provided by information theory and multivariate statistics, and outlines computational studies showing the basic capabilities of simple networks built from local processors with contextual guidance. Section 4 outlines evidence from neurobiology for contextual guidance and synchronized population codes. Section 5 discusses the relevance of these hypotheses to psychological issues, outlining evidence that is already available from behavioral studies as well as ways in which our suggestions can be further tested and developed by such studies. Finally, Section 6 discusses a few of the issues that arise from these hypotheses.
A starting point for our approach is the hypothesis that, although cortical circuits are constrained to operate just upon the information that is locally available to them, coordination of their activity with what is going on elsewhere is central to their computational role. This is possible because they receive locally specific contextual input from other processors (but directly only from a tiny fraction of the other processors in the cortex as a whole). The contextual input is used to selectively enhance the transmission of that information in the processor's receptive field (RF) input that is coherently related to the context. Networks of such processors therefore tend to transmit sets of signals that as far as possible maximize their mutual coherence. As a signal only transmits information if it varies we call this capability the maximization of coherent variation. Useful consequences that follow from it are discussed below, and in Section 1.2.3 in particular.
The usefulness of the ability to organize distributed patterns of activity into coherent groups is widely acknowledged in discussions of the "binding problem". This problem would be solved if cells currently forming a coherent group synchronized their spike trains to within a few msec. This possibility was proposed by Milner (1974) and has long been advocated on both theoretical and biophysical grounds (von der Malsburg 1981). Neurophysiological evidence to be outlined in Section 4 now suggests that the spiking activity of cortical neurons can be synchronized to within a few milliseconds in a way that is appropriate to the prevailing context, and which includes synchronization between neurons in different streams of processing and between neurons at different stages of processing.
Synchronization would be an effective signal for grouping because inputs to pyramidal neurons are summed much more effectively if they are synchronized (Abeles 1982, 1991; Bernander et al. 1991). It is an inherently relational signal because it depends upon temporal relations between inputs from separate sources. Thus, unlike the more commonly studied rate and place codes, it is not defined upon the signals produced by individual cells, and will not be revealed by studies of single cell activity.
A major feature of the work on synchronization is that it suggests the existence of specialized cortico-cortical synchronizing connections that modulate post-synaptic activity but without corrupting the information that is transmitted about the receptive field features to which the cell is selectively sensitive (Engel et al. 1991b; Munk et al. 1992; Lowel & Singer 1992; Konig et al. 1993). That is, they help determine exactly when a cell fires, but they do not change the feature that is signaled by that activity. To explain how this is possible it is often suggested that this can be done by using the synchronizing connections to influence the phase but not the amplitude of oscillatory outputs that are produced by the local processors (e.g. Hummel & Biederman 1992; Shastri & Ajjannagadde 1993; Schillen & Konig 1994). Though useful, this may not be a general solution to the problem of combining both feature and grouping information within the same signal, however (e.g. see Nelson 1995). First, although oscillations are likely to play a major role in synchronization, they are not necessary because even single impulses can be synchronized. Second, there is evidence that synchronization can occur without oscillations (Konig et al. 1995). Third, there is doubt as to the generality of oscillations in the normal functioning of cortex (e.g. Tovee & Rolls 1992; Young et al. 1992; Bair et al. 1994). The computational studies described in Section 3 show how contextual inputs can guide processing without corrupting the transmission of RF information, and they do this in a way that does not require oscillations even though it is compatible with them.
Another major focus of this paper is on the possibilities that arise for learning when processors receive local contextual inputs. The computational studies outlined in Section 3 show that idealized local processors with contextual inputs can discover those receptive field features that are predictably related to the context within which they occur together with discovering the predictive relations between them. That is, in addition to learning the associations between features, the local processors can preferentially discover those features that are associated.
Local processors with contextual guidance receive a set of receptive field (RF) inputs and a set of contextual field (CF) inputs (Figure 1a). These processors are intended to be loosely analogous to local cortical circuits. In relation to the RF input they act as filters that transmit information about the RF features to which they are selectively sensitive. This selective sensitivity is specified by the strengths, W, of the synaptic connections that mediate the RF input. In addition, the probability that they transmit information about any RF feature at any moment is increased if that feature is as predicted by the context, and is decreased if it is incompatible with that prediction. The predictions are specified by the CF inputs as mediated by the strengths, V, of their synaptic connections. A crucial aspect of this form of processing is that the predictions are not confused with the RF evidence. The role of context is not to impose its predictions upon the processor, but to emphasize those outputs for which there is RF evidence and which are coherently related to the context. This can be done by using context to influence the confidence with which decisions are made on the basis of the RF evidence and to synchronize coherent outputs.
The strengths of the synapses, W and V, are not permanently fixed, but can change so as to better adapt the detailed operations performed by the processors to the statistical structure of the inputs that they receive. We hypothesize that providing local processors with contextual input enhances the learning of which they are capable. Major issues to be discussed are therefore what the goals of this learning could be, and by what synaptic modifications they may be achieved.
As illustrated by the width of the arrows in Figure 1, processors are assumed to have fewer outputs than RF inputs. This reflects the long-standing hypothesis that a major goal of sensory and perceptual processes is recoding to reduce redundancy (Attneave 1954; Barlow 1959, 1961, 1972, 1989; Linsker 1988; Barlow & Foldiak 1989; Foldiak 1990; Baddeley & Hancock 1991; Atick 1992; Atick & Redlich 1990, 1993; Redlich 1993; Li & Atick 1994). The underlying idea is that the flood of data to be processed can be reduced to more manageable amounts by using the statistical structure within the data to recode the information that it contains, with frequent input patterns being translated into codes that contain much less data than the patterns themselves. If the computational goals can be clearly specified, such as by using information theory, then rules for learning can be derived from those goals (Intrator & Cooper 1995a). An important limitation of recoding as a goal for cortical computation is that it is ultimately sub-ordinate to the goal of associative learning. There would be no point in recoding information about variables that have no relation to anything else known to the system. Proponents of recoding to reduce redundancy therefore usually see it as being preparatory to associative learning (e.g. Barlow 1993). A distinctive advantage of the processors that we propose here is that they are not forced to transmit just whatever RF variables carry the most information, but can selectively discover those that are associatively related to the context within which the processor operates.
Many different network architectures can be built from such processors. Felleman and Van Essen (1991) review a great deal of evidence concerning the overall system architecture within which local cortical processors operate. They distinguish three broad classes of connections between cortical regions: ascending feedforward connections, descending feedback connections, and lateral connections between regions that are at approximately the same stage of processing. The ascending projections from one stage to the next are localized such that neurons receive their primary ascending inputs from a small sub- set of neurons at the preceding stage, with different local groups of neurons receiving from different sub-sets. Thus, many distinct streams of processing project through a few stages in converging and diverging ways with primary feedforward connections being distinguishable from lateral and descending connections.
The local processors hypothesized here are broadly compatible with such an architecture. The ascending connections could provide much of the RF input, and both the descending and the lateral connections could include CF input. Furthermore, within cortical regions there are many distinct streams of processing and these are linked by long-range horizontal collaterals that could transmit synchronizing contextual information. Mutual contextual guidance of this sort is shown in Figure 1b and it could link distinct streams of processing both within and across cortical regions. Another possibility is that contextual inputs could be received from processors to which the processor concerned contributes RF input. This is shown in Figure 1c, and it would enable activity to be coordinated across different stages of processing, as well as within stages. Finally, Figure 1d shows that one set of RF signals can be transmitted to separate processors with different contextual fields. Each processor will emphasize the RF information that is relevant to its context, thus enabling different processors to extract different aspects of the same RF activity.
The patterns of connectivity shown in Figure 1 are not mutually exclusive, and can be combined in various ways. Furthermore, separate modules with recurrent internal RF connectivity could be linked by CFs that coordinate their activities. We assume that genetic constraints play a major role in determining patterns of RF and CF connectivity, and that the CF inputs are specific to the role of each local processor, just as are the receptive field inputs. For example, at early stages of the analysis of a visual scene other parts of the scene might provide a useful context, whereas at later stages of processing information from other modalities might provide a more appropriate context. If local cortical processors do receive specialized contextual inputs as proposed then principles or heuristics for determining where they should come from will be an important issue. Tononi et al. (1994, 1996) have shown that the overall pattern of cortical connectivity balances functional integration, produced here by the contextual inputs, against functional segregation, produced here by the RF filter functions, so as to produce a system with high complexity, high computing power, and the ability to use context to `go beyond the sensory information given' in an appropriate way.
1. The most important capability that arises in relation to processing, is that the effects of context will be such as to increase the probability that mutually coherent sub-sets of units will be active at any moment. That is, they will tend to produce synchronized population codes. As has already been argued in detail elsewhere (Singer 1990; 1993; 1994) such codes have important advantages: they are flexible because they are created dynamically; they are fast because in the limit all that needs to be synchronized are single spikes from each of the cells to be grouped; they can signal many different patterns because each cell can at different times be part of many different groups; they do not compromise the meaning of the signals to be grouped; and finally, they transmit appropriately structured patterns of activity rather than just arbitrary or unstructured labels (Phillips 1996; Phillips et al. 1995a).
A layer built from local processors with contextual guidance therefore produces patterns of activity where the RF filter functions ensure that the individual signals are justified by the RF input and the contextual connections maximize their coherence as a group. This implies that in the case of perceptual grouping, for example, the Gestalt criteria are embodied in the CF connections between the entities that are grouped. This predicts that synchronization of active cells in the visual system should reflect Gestalt principles of grouping, and evidence that this is so will be outlined in Section 4.
One way to see what is implied by dynamic flexibility is to note that if these processes are common to cortex then the inputs to each area will themselves be organized by the grouping processes operating in the areas generating those inputs. The RF filter functions will therefore not operate upon rigidly fixed data bases but upon ones that are already organized so as to emphasize coherent sub-sets of data within the RF. As those coherent subsets can occur within each RF in a very large number of different ways this enables the receiving cells to respond appropriately to many more inputs than they could do without the dynamic grouping. In addition, there is evidence that cortex receives inputs that are dynamically grouped in the thalamus (Sillito et al. 1994) and retina (Neuenschwander et al. 1996). (see Note 1)
2. In addition to grouping related features that are clearly evident in the RF input, local contextual information could improve the perception of features that are weak or ambiguous.
3. Important capabilities that arise in relation to learning are that contextual input enables local processors to become selectively sensitive to those variables within their RF input that are predictably related to that context, and to do so together with learning the predictions. (see Note 2)
The possibility of using relations between separate data-sets as a basis for self-organization is illustrated in Figure 2. As an example of this approach Becker and Hinton (1992) show how stereo-depth can be discovered by using the mutual information between separate streams of processing that receive inputs from neighboring patches of the image that are independent except for having the same disparity. Stone and Bray (1995) and Stone (1996) have also shown how coherence across time can be used to learn invariances and other salient visual parameters. (see Note 3) A general epistemological argument for this approach is that predictive relationships between diverse data-sets must depend in some way upon their distal origins. Discovering those relationships will therefore reveal distal variables and interactions within the proximal data.
4. Processors with contextual inputs from different sources, as shown in Figure 1d, will become selectively sensitive to just those aspects of their inputs that are relevant to those contexts. Thus this will help create appropriate functional specializations, and it will entail appropriate generalizations because the outputs of each processor will generalize across irrelevant dimensions of RF variation. To recognize facial expressions we must do so despite variations in personal identity; to recognize individual faces we must do so despite variations in facial expressions. Variables that are crucial to one goal may be irrelevant sources of noise to another. This problem would be solved if different cortical regions have a selective sensitivity to just those variables that are relevant to their role, and the evidence suggests that this is how face perception is organized (Bruce 1988); but how can local processors know what is relevant? Genetic specification cannot be the whole answer because some functional specialization is established through learning. Context could contribute to such learning by guiding RF selectivity to the relevant variables. For example, regions receiving RF input from visually perceived faces and CF input from regions concerned with evaluating emotional states could then learn to become selectively sensitive to just those variables in face images that are predictably related to emotional expression.
The CF inputs that we postulate are not equivalent to inputs from `beyond the classical receptive field' in general. Many investigations show that the response of cortical cells to their preferred stimulus is suppressed by the presence of similar stimuli in the surround (e.g. Blakemore & Tobin 1972; Nelson & Frost 1978; Allman et al. 1985; Knierem & Van Essen 1992). These effects are not due to the CF inputs that we propose because they are not concerned with producing coherent patterns of activity across multiple feature detectors, but with using information about the surround to suppress responses that do not differ from that surround. This is quite different from the use of contextual predictions to increase the probability of signals that agree with those predictions. It is more appropriate to view the subtraction of activity that is summed over some surrounding region as being included in the mechanisms that determine RF selectivity, rather than as being part of the mechanism for coordinating the activity of many simultaneously active feature detectors.Receptive fields that emphasize contrast with the surround show how the maximization of coherence is compatible with evidence for processes that emphasize the unexpected. If a single element differs from the others in an array on some simple variable, such as color or orientation, then the odd one out is very noticeable. This is evidence for processes that emphasize what is not predicted by the surround, and we account for that evidence by noting that RFs usually develop so as to detect such differences. Enhanced transmission of the unexpected is also proposed by some theories to be a major role for the descending feedback projections from higher to lower stages of processing (e.g. Mumford, 1992; Pece, 1992). In contrast to these theories, Sillito et al. (1994) provide evidence that feedback from V1 to the LGN synchronizes the activity of those LGN cells that agree with the interpretation at the higher level. Furthermore, psychological experiments show that context often supports what is consistent with that context (e.g. Biederman, 1972; Palmer, 1975; McClelland 1978; McClelland, Rumelhart, and Hinton, 1986).
We also need to distinguish the population codes proposed here from the population vector codes proposed by Georgeopolous (1990). The latter is a proposal about how a single vector could be signaled by the activity of a group of cells. Synchronization specifies which cells are in the same group. Synchronized population codes are compatible with but do not require population vector coding.
Finally we need to relate the local contextual guidance that we hypothesize to spatial attention, arousal, and other strategic control processes. Local contextual guidance arises automatically from the interactions of local processors and does not require specialized circuitry such as that hypothesized to be involved in spatial attention (e.g. Van Essen et al. 1994; Posner & Rothbart 1994) . Furthermore, the contextual inputs that we postulate are highly specific in relation to both timing and the features that they predict. Spatial attention seems to operate on a longer time scale, and simply to enhance the processing of whatever features are present at the locations and spatial scales attended, rather than to enhance some features and to oppose others on a locally specific basis (Nakayama & Mackeben 1989; Krose & Julesz 1989). Nevertheless, if there are local contextual interactions of the kind that we postulate then attentional control processes will operate upon the synchronized population codes that those interactions produce, possibly themselves using mechanisms that increase the synchronicity of attended items (see Tiitinen et al. 1993 for empirical evidence on this from EEG recordings, and see Goebel 1993 for computational studies).
The idea of population coding has a long history, with Hebb's (1949) notion of the cell-assembly serving as the leading representative. The further possibility that synchronized activity on a fine time scale specifies which sub-set of neurons are grouped to form the population code at any moment also has a long history (e.g. von der Malsburg 1981; Wang et al. 1990; Crick & Koch 1990; Eckhorn et al. 1991a,b; Engel et al. 1992; Tononi et al. 1992a,b; Abeles et al. 1993a,b; Bienenstock 1995; Yamaguchi & Hiroshi 1994; Goebel 1993). It has also been shown how synchronization can play a major role in neural network architectures based upon adaptive resonance (ARTMAPS) and upon the boundary contour system (BCS) (Grossberg & Somers 1991; Grossberg 1993).
Various versions of the distinction between RFs and CFs also occur in prior theories. The coupling connections referred to in previous discussions of the substrate of synchronization (e.g. Singer 1990; 1993; 1994) are an example of what are referred to here as CFs. Further examples are the linking connections proposed by Eckhorn et al. (1991a), and the fast enabling links proposed by Hummel and Biederman (1992). Some aspects of the distinction also occur in theories that do not rely upon synchronization. Ullman (1994), for example, proposes that there is a bi-directional, bottom- up and top-down, flow of information in which activity in one stream "primes" activity in the equivalent units in the reverse stream. This priming is such as to increase the probability of transmission of any RF information with which it agrees. It uses mechanisms that differ from those of CFs but has similar effects in that it modulates the probability of signals being transmitted in very locally specific ways, and it does so without corrupting the transmission of RF information.
One of the earliest theories with a distinction that is analogous to that between RFs and CFs is that of Edelman (1978, 1989). This approach has been developed using highly detailed synthetic modeling (Reeke et al. 1990), which uses large simulations (from about 1 to about 3.5 million connections), with many biological features from intra-cellular processing to the overall principles of connectivity being built-in. The models developed achieve perceptual grouping (Sporns et al. 1991), form recognition that is independent of color and position (Tononi et al 1992b), and also account for several other perceptual phenomena. Detailed analysis of their fine-grained internal temporal dynamics fits well with that observed in cortex (Sporns et al. 1989; Tononi et al 1992b). Phasic reentrant signaling is crucial to the success of these simulations. Its functional role and mechanisms are closely analogous to that proposed here for contextual input. Our approach differs from theirs in minor differences of emphasis, however. For example, we put more stress upon simplifying computational studies, upon formal descriptions of the underlying computational goals, and upon the possibility that information supplied by the contextual connections could guide RF learning thereby helping to establish some of the functional specialization that studies of reentrance have so far built-in.
The use of predictive relationships between separate streams of processing to guide learning within streams has also been studied previously (e.g. Becker & Hinton 1992; Becker 1996; Schmidhuber & Prelinger 1993; De Sa 1994 a,b; Stone & Bray 1995, Stone 1996). Although there is basic agreement between our work and that of Becker and Hinton (1992), there are some important differences. One difference is that we emphasize the use of context to coordinate ongoing activity and to form synchronized population codes, whereas Becker and Hinton communicate information between streams of processing for the purposes of learning only. The reason they did this was to ensure that distinct streams of processing could not increase the mutual information in their outputs simply by driving each other. Thus in the approach of Becker and Hinton (1992) local processors receive inputs that are used to change synaptic strength without directly affecting post-synaptic activity. We know of no biological evidence for such a process. Furthermore, a second major difference that follows from the first is that in their approach there are no cross-stream predictions to learn, whereas in our approach the CF predictions play a major role because they embody the knowledge that is used to integrate ongoing activity and form synchronized population codes. A third difference is that in our approach a single parameter specifies the balance between maximizing information transmission within streams and maximizing coherence between streams. No such parameter exists within their approach.
The above considerations lead to hypotheses that we expect to be controversial. The first is that there are basic computational capabilities that are common to many different cortical regions and to many different species. Second, in relation to the general functional role of any such capabilities, our hypothesis is that they include processes that gradually adapt those computations to the general statistical structure of the world in which the cortex finds itself, and they do so by maximizing the transmission of information that is predictably related to the context within which it occurs. Third, in relation to coding, we argue for synchronized population codes. Such codes contrast with single-cell codes in that they convey information about internal structure, and they contrast with the more usual form of distributed code in that stored knowledge is used to group the elements into coherent sub-sets. Fourth, in relation to the short-term processing dynamics, we hypothesize that local processors use contextual predictions to guide processing but without confounding those predictions with the information that they transmit about their receptive field inputs. This contrasts with the assumption, common to many connectionist theories of cognition, that local processors treat all of their specific informative inputs in essentially the same kind of way. Fifth, in relation to learning, we propose that RF features that are predictably related to the context within which they occur can be discovered, and that this occurs together with discovery of the predictive relationships between them. This contrasts with the common assumption that feature discovery is independent of associative learning. Finally, in relation to epistemology, we suggest that by discovering latent variables within diverse data-sets and the relations between them the local processors are in effect discovering distal variables and relationships. As a consequence they lay foundations for representation and meaning. Nevertheless, we will argue that these foundations do not constitute intentional representation proper because such local processors do not distinguish between the signals that they receive and the distal causes from which those signals arise.
It is well established that cortex contains many specialized regions but our central concern is with their internal organization. In what ways is it common, and in what ways does it vary? Although differences exist there is a widespread belief in commonalities: "It is easy to recognize a histological (e.g. Golgi) preparation as being cortex rather than cerebellum or tectum. It is much more difficult to tell whether it is human or bovine, motor, sensory, or associative cortex.", (Braitenberg 1978, page 444); "Laminations and vertical connections between laminae are hallmarks of all cortical systems, the morphological and physiological characteristics of cortical neurons are equivalent in different species, as are the kinds of synaptic interactions involving cortical neurons. This similarity in the organization of the cerebral cortex extends even to the specific details of cortical circuitry.", (White 1989, page 179); "Despite the many detailed properties that can be used to differentiate among the various cortical areas, the common properties of all the cortical areas are overwhelming. The same cell types, the same types of connections, and the same distributions of cells and connections across the cortical depth are found in all parts of the isocortex. These properties of the cortex are markedly different from those found in the other parts of the brain.", (Abeles 1991, page 33). If there are commonalities then it is crucial to find out what they are. For extensive reviews of this issue see Edelman and Mountcastle (1978), Rakic and Singer (1988), Martin (1988), White (1989), Shepherd (1990), Braitenberg and Schuz (1991), and Abeles (1991). Commonalities may exist at a number of different levels of organization and with respect to various aspects of function. Some may arise from small populations of pyramidal cells and their associated local circuit neurons, such as proposed for the "canonical circuit" of Douglas and Martin (1990), or the "basic circuit" of Shepherd and Koch (1990). Others may arise at lower levels such as the morphology and physiology that is common to pyramidal cells. Note, therefore, that a common multi-cellular circuit is not necessarily implied by the hypothesis of common foundations for cortical computation, because some of them may arise at other levels of organization.
The basic homogeneity of the neocortex is widely thought to imply common information processing operations: "The typical wiring of the cortex, which is invariant irrespective of local functional specialization, must be the substrate of a special kind of operation which is typical for the cortical level.", (Braitenberg 1978, page 444); "It is taken as an article of faith that there is an information processing algorithm unique to cortex that is a product of the regularities of its architecture." (Stryker et al. 1988, page 133). "For many anatomists, it seems perverse to regard the visual cortex as an ad hoc collection of specialist circuits, rather than a set of basic circuits adapted to perform many different tasks. ..... For the neocortex, an unconventional class of models needs to be developed - models that are neural networks, but based directly on the biology; derived from visual cortex, but not designed to solve a particular problem in visual processing.", (Douglas & Martin 1991, pages 291-292). Such views have a long history (e.g. Lorento de No 1949; Edelman & Mountcastle 1978; Rockel et al. 1980).
The belief in commonalities is supported by evidence that cortex contains some generalized learning algorithm that adapts each region to the input that it receives. For example, it has been shown that sensitivity to visual features can be induced in the primary auditory cortex of neonatal ferrets by replacing its normal auditory input with visual projections (Sur et al. 1988). Similarly, it has been shown that visual cortex has the potential to develop an array of functional units that is appropriate to the somatosensory input (Schlaggar & O'Leary 1991).
Although these arguments for commonalities have force, they are not conclusive. Noting similarities will not be convincing until we can clearly see how they provide capabilities that are of common utility. Differences that are critical from a computational point of view may not be obvious from an anatomical or physiological point of view. Furthermore, the suggestion that some form of columnar organization is common to the whole of cortex can be criticized (e.g. Swindale 1990; Purves et al. 1992). It is therefore important to note that although "cortical columns" are not central to the hypotheses developed here, criticism of this idea suggests limitations upon anatomical arguments for commonalities.
Functional specialization is also a major feature of cognitive organization. This has been established by studies of both normal and brain damaged subjects, with cognitive neuropsychology providing a rich source of data and theory (Ellis & Young 1988; Shallice 1988; McCarthy & Warrington 1990). Functional specialization is most firmly established for perceptual and motor functions. Its existence and nature within the highest level functions such as strategic control is less firmly established but there is some evidence for it even there (Shallice 1988; 1991).
The inferences drawn from cognitive and neuropsychological investigations are often shown in diagrams of functional specialization and information flow. Our primary concern here is not with this system level of organization, however, but with the operations that are performed by the various cognitive sub-systems. What are these operations, and which, if any, are common to different sub-systems? Mapping the cognitive architecture is a complex and important task, but adding or deleting sub-systems and routes between them will only be crucial to the search for commonalities to the extent that this changes the set of basic computational capabilities required. What those capabilities are is not obvious, and this issue needs wider discussion.
One simple aspect of cognitive neuropsychological theory that suggests common operations is that sub-systems are often distinguished by the content of the information with which they are concerned. This suggests that they differ primarily in what they operate upon, rather than in the operations that they are required to perform upon that information.
In contrast to our emphasis upon commonalities, studies of basic cognitive processes can give rise to skepticism about the value of a search for general principles. Crick reports the view to which Rama Ramachandran has been led by his elegant and ingenious psychophysical studies as follows "It may not be too farfetched to suggest that the visual system uses a bewildering array of special-purpose tailor-made tricks and rules-of-thumb to solve its problems. If this pessimistic view of perception is correct, then the task of vision researchers ought to be to uncover these rules rather than to attribute to the system a degree of sophistication that it simply doesn't possess. Seeking overarching principles may be an exercise in futility." (Crick 1988, page 156). Even with respect to Ramachandran's argument, however, Crick then adds: "It is, of course, possible that underlying all the various tricks there are just a few basic learning algorithms that, building on the crude structures produced by genetics, produces this complicated variety of mechanisms." (1988 page 156).
The cerebral neocortex evolved as an add-on to pre-existing neural systems, and has expanded rapidly at various stages of mammalian evolution (Jerison 1973). The speed of this evolution has been used to support the view that it embodies a multi-purpose form of computation: "Neocortex has expanded rapidly in phylogeny by creating multiple new areas. While mammals with very small cortices have behavioral capacities no more impressive than noncorticate animals, the capacity for rapid phylogenetic change may be the most important feature of cortex.", (Stryker 1988, page 133); "There is a separate and important evolutionary function that a generic principle for the development of a perceptual network layer - whether it be infomax or some other principle - can serve. Suppose that an evolutionary mutation produces a modified eye, or merges the auditory signals into the visual pathway at some new point. If there were no generic principle for layer development, we might imagine that mutations would have to occur simultaneously in the processing function of several layers, for those layers to be able to use the novel input properly. But if there is such a generic principle - one that applies to each layer regardless of what type of input reaches it - then the novel input will automatically be processed in accordance with that principle. This suggests that the existence of a generic principle may greatly increase the likelihood of a mutation being adaptive.", (Linsker 1988, page 116 - 117).
Another evolutionary argument for commonalities arises from the comparative study of learning. After an extensive search for basic differences in learning abilities across various species Macphail (1987) concluded that all of the problem solving abilities of non-human animals arise directly from a common basic associative process. Furthermore, the common process that he inferred from these comparisons is one that learns the causal links between events, i.e. one that learns what predicts what.
Evolutionary arguments can also be used to oppose the hypothesis of commonalities, however. Tooby and Cosmides (1995) say that the evolutionary perspective entails the functional analysis of niche-differentiated cognitive and neural machinery that is unique to the species: "the human cognitive architecture is far more likely to resemble a confederation of hundreds or thousands of functionally dedicated computers, designed to solve problems endemic to the Pleistocene, than it is to resemble a single general- purpose computer equipped with a small number of general- purpose procedures such as association formation, categorization, or production-rule formation", (Tooby and Cosmides 1995, page 1189). The denial of a small number of general-purpose learning procedures is a crucial part of this perspective. Gallistel (1995) concludes that the catalog of special-purpose learning procedures, such as the ability of birds to learn the position of the celestial pole, could be enlarged indefinitely. (see Note 4)
These arguments do not settle the issue, however. Neither the presence of highly specific abilities nor the absence of a single all- powerful ability implies that there are no abilities that are common to many different species and to many different cortical regions. To rebut the view that classical and operant conditioning are general purpose procedures Gallistel (1995) proposes that they are specialized for the solution of problems in multivariate, non- stationary time series analysis. This enables them to figure out what predicts what. Such a capability may not be all-purpose but it is far less specialized than an ability that can learn only the position of the celestial pole.
Impressive advances in the theory and technology of neural computation since 1980 have greatly encouraged our search for commonalities. This is because they show that powerful multi- purpose capabilities can be implemented in neural systems ( e.g. Rumelhart & McClelland 1986; Gluck & Rumelhart 1990; Amit 1989). They suggest that these capabilities are likely to contrast with those of conventional von Neumann computation, and it has often been proposed that this contrast depends upon the use of distributed representations or population codes: "Distributed representations give rise to some powerful and unexpected emergent properties. ... For example, distributed representations are good for content-addressable memory, automatic generalization, and the selection of the rule that best fits the current situation. ... Thus, the contribution that an analysis of distributed representations can make to these higher-level formalisms is to legitimize certain powerful, primitive operations which would otherwise appear to be an appeal to magic ", (Hinton et al. 1986, page 79). This viewpoint is important because it suggests that cognition may be based upon computational primitives that are not obvious a priori, but which are of general utility.
This section is concerned with what Marr (1982) calls computational theory. If we are ever to understand how the cortex works then we must understand the work that it does. What that work might be at the level of local cortical circuits is far from obvious. The hypothesis being examined here is that it includes the maximization of coherent variation, i.e. transmitting as much information as possible while keeping it coherently related to what is going on elsewhere, and thus keeping it "meaningful". These studies are designed as simplifying abstractions, not as detailed models of biological systems. Their goal is to make the underlying computational task and strategy clear (Marr 1982; Sejnowski et al. 1988; Phillips 1996). This will make it easier to build and interpret detailed models that embody that strategy, and to design experimental paradigms to determine whether it is used by real biological systems.
If the role of context is to modulate transmission through local processors so as to emphasize coherent outputs but without corrupting the information that is transmitted about the RF input then we need a transfer function with the following properties. If there is no RF input then output should remain at the neutral level; if there is no CF input then the output should be monotonically related to RF input in some standard nonlinear and biologically plausible way; if RF and CF inputs agree then the gain of the function relating output to RF input should be increased; if RF and CF inputs disagree then the gain of the function relating output to RF input should be decreased; CF input should affect the confidence with which decisions are made but only the RF input should determine what decisions are made. Physiological studies to be outlined in Section 4.2. show that neurons do indeed receive two classes of input that differ in approximately the way that this suggests. In addition to the classical forms of excitatory and inhibitory input they also receive inputs, such as those mediated by NMDA receptors, whose effects depend upon the prevailing state of activation and which could therefore fulfill the gain-controlling role of CFs (e.g. Fox et al. 1990). A function, A(r,c), giving the internal activation of probabilistic bipolar (-1,1) units has been derived from these computational and physiological considerations (Kay & Phillips 1994,1996; Phillips et al. 1995b), such that
A(r,c) = 0.5r (1+ exp(2rc) )
where r = summed weighted RF inputs including any bias input, c = summed weighted CF inputs including any bias input. An equivalent activation function can be given for processors that produce binary (0,1) outputs. To compute the output probability in the simulations we apply the standard logistic squashing function to the internal activation, so the transfer function as a whole is composed of the activation function followed by the squashing function. The neutral level is given by an output probability of 0.5. The continuous value transmitted between units is the expected value of outputs with this probability, which ranges from -1 to 1. As Figure 3 shows, this transfer function has the properties required. It is not unique, but it is a clear and simple representative of the limited class of functions with these properties (Kay & Phillips 1994,1996). A natural interpretation of the output given by this transfer function is that it gives the probability of a discrete event such as an action potential.
Our computational and empirical studies both emphasize three closely inter-related but distinct forms of neural signaling: relative timing, place, and firing rate. The possibility of using synchronization, or relative timing, to signal grouping using the transfer function just defined follows from the way in which the CFs influence output probability. They increase the probability that outputs from different processors will be produced at the same time if they are mutually predictive, and they reduce this probability if the outputs are opposed. Section 3.4.3 shows that this produces coherent groupings.Place coding is the transmission of information about different features or variables by different cells. It allows for the possibility that a number of different cells could all transmit information about the same feature. This form of signaling is preserved in the computational studies through the use of outputs from different units to signal different variables. Our emphasis upon self- organization implies that this coding is not fully pre-specified, but may change as the system adapts to its inputs through learning.
The classical form assumed for rate coding is the transmission of information through the firing rate of single cells measured over a time period that is long relative to the duration of individual action potentials. This is one way to transmit information about continuous variables such as the output probabilities that are generated by the above transfer function. It is not the only way, however. Imagine a set of cells that produces action potentials with a probability that is essentially the same for all cells, e.g. some form of neuronal group as proposed by Edelman (1978, 1989). The crudest estimate of that probability is given by sampling a single cell for a single brief interval that is long enough for just one binary output, e.g. 1 or 2 msec. A better estimate can be obtained by sampling many of the cells for this brief interval. If the output probability remains approximately constant for a time of more than 1 or 2 msec then an even better estimate can be obtained by sampling many of the cells over that longer time. Thus in this simple case these measures give different estimates, with varying amounts of precision and bias, of the same underlying quantity. This suggests that much of classical single-unit neurophysiology has been developed so as to exploit situations in which the relevant underlying quantity remains constant for long enough to allow adequate estimates to be obtained by sampling a single cell over longer intervals and by averaging across trials. The success of this enterprise does not imply the absence of sets of cells signaling the same underlying quantity in those situations, nor does it imply the absence of situations where that underlying quantity changes rapidly. The only way to obtain an accurate estimate in the latter case would be by averaging the outputs of a set of cells over a brief interval.
Continuous values were transmitted between processors in most of the simulations summarized below. A few simulations have been run in which only binary values were transmitted, however, i.e. single units were used for each output probability and at each iteration of the computation of the short-term dynamics each unit transmitted just a single binary output with that probability. Performance did not seem to be very sensitive to this change, so our working assumption is that high precision in transmitting the output probabilities is not a necessary requirement of the computational approach being developed here.
The goal of maximizing the transmission of coherent information can be specified in a precise but general way by using the concepts of Shannon Entropy, mutual information, and conditional information (Kay & Phillips 1994,1996; Phillips et al. 1995a; b). The Shannon Entropy, H(X), is the average amount of information in any variable X with a given probability distribution. Mutual information, I(X;Y), is a measure of the average amount of information that is shared by the probability distributions of two variables, e.g. X and Y. It is a measure of the extent to which uncertainty about one variable is reduced by observing the other, and it is a commonly used measure of information transmission. If two variables are independent then their mutual information will be zero. Conditional information, H(X|Y), is a measure of how much uncertainty is left about one variable, e.g. X, given that we already know another variable, e.g. Y. From these definitions it follows that H(X) = I(X;Y) + H(X|Y). For a lucid introduction to these concepts see Hamming (1980).
Consider a local processor to have input vectors R and C constituting the RF and CF inputs respectively, and to produce an output vector X. The Shannon entropy in X can be decomposed as follows
H(X) = I(X;R;C) + I(X;R|C) + I(X;C|R) + H(X|R,C)
where the first term on the right is a measure of the information that is common to X, R and C; the second is that common to X and R but not to C; the third is that common to X and C but not to R; and the fourth is information in X that is in neither R nor C. Figure 4 illustrates this decomposition in the case where all components are positive.
A goal for any local processor, X, can now be specified in terms of these four components. Each processor must adapt on the basis of just the information that is locally accessible to it. We therefore specify how X should adapt taking R and C as givens, but allowing for the possibility that any connections upon which R and C depend may themselves be adapting in the same way. We require X to convey information about major sources of variation in R, and in particular those that are predictably related to C. Data compression, as argued for in Sections 1.1 and 1.2.1, will be ensured by constraining processors to have fewer outputs than inputs.
Discovering major sources of variation in R requires the maximization of the mutual information between the output and the RF input, I(X; R), which consists of two components, i.e. I(X; R; C) + I(X; R | C). Consider first the transmitted information that is common to the RFs and CFs, i.e. I(X; R; C). This is the RF information that is coherently related to the context, and we require the local processor to transmit as much of it as possible. If the RFs and CFs arise from separate data-sets then any information that they share must reflect some common distal influences upon those data-sets, and the more diverse the data-sets the more distal those common influences are likely to be. Maximizing this component is therefore likely to transmit information about variables with relevance to the environment within which the system operates. Variables in the RF input that are unrelated to the context, i.e. I(X; R | C), may also be useful at some later layer of processing or stage of learning, however, so this component could also be increased, though with a lower priority than the information that is meaningfully related to its local context. Information that is shared by X and C but not by R, i.e. I(X; C | R), should be decreased because the role of X is to transmit information about R, but not about C. Finally, H(X | R, C) denotes variation in X that is due to neither R nor C, i.e. intrinsic noise that is added by the processor to its output . We would normally wish this component to be reduced.
We can now formulate the general class of objective functions
F = a0 I(X;R;C) + a1 I(X;R|C) + a2 I(X;C|R) + a3 H(X|R,C)
where F is the objective to be maximized, and a0-3 are parameters in the range 1 to -1. These parameters weight the various components of the transmitted information, H(X), with positive values for the components that we wish to increase and negative values for components that we wish to actively decrease. Different objectives can therefore be given as different values of these parameters. The goal of maximizing information transmission within streams requires an objective function F = I(X;R), which is given by setting a0 = a1 = 1 and a2 = a3 = 0. This is the goal studied by Linsker (1988) and many others, and he calls it Infomax. The goal of maximizing the transmission of information that is predictably related to the context requires an objective function F = I(X;R;C). This is given by setting a0 = 1 and a1 = a2 = a3 = 0. We call this goal Coherent Infomax. It is equivalent to that intended by Becker and Hinton (1992) but has a different form because we make explicit the requirement to maximize the transmission of RF information. (see Note 5)
Learning rules can be derived by performing gradient ascent on the objective function, F, relative to the strengths of the RF and CF connections (Kay & Phillips 1994,1996; Phillips et al. 1995b). The dependence of change in connection strength upon post-synaptic activity as specified by these rules is shown in Figure 5 (from Smyth, 1994). The learning rules have this same general form for the RFs and for the CFs. The change in synaptic strength is proportional to pre-synaptic activity, but it is non-monotonically related to post-synaptic activity. The non-monotonicity required is similar to the computationally powerful BCM learning rule proposed by Bienenstock et al. (1982), and to a simpler version, the ABS rule, that has been shown to have biological plausibility (Artola et al. 1990; Hancock et al. 1991a;b). Other learning rules have also been developed within this general approach, including one that maximizes the covariance between the integrated RF and CF inputs (Smyth & Der, 1995; Der & Smyth 1996). This latter rule is easier to implement than that shown in Figure 5, but has the same overall form of dependence on the level of post-synaptic activity, which it shares with the BCM and ABS rules.
A major feature of the learning rule shown in Figure 5 is the threshold of post-synaptic activity below which connection strengths are decreased and above which they are increased. In the rules derived by Kay and Phillips this depends upon three specific dynamic conditional averages of prior activity, i.e. the average prior output probability of the unit taken over all RF and CF inputs; the average output probability for the current RF input taken over all CF inputs; and the average output probability for the current CF input taken over all RF inputs. (see Note 6) Only the first of these three is used by the BCM rule, and its role is to make it harder to increase the strengths of connections to units that have already been too frequently active, and vice versa.
A simulation performed by Darragh Smyth at Stirling shows how context can guide RF learning. Two streams of single-unit processors were linked by CFs, and as a consequence they were able to discover variables that were correlated across streams (Figure 6). Pairs of vertical or horizontal bars were presented that were both vertical on 70% of occasions and both horizontal on 30% at random. The bars were bright on dark or vice versa at random. The signs of the vertical bars were uncorrelated across streams, but the signs of the horizontal bars were perfectly correlated across streams. The sign of the vertical bar therefore carries more information within streams, but the sign of the horizontal bar is more relevant to the correlation across streams.
The course of learning and the receptive fields found after learning are shown in Figure 7. When the goal of learning was specified to be the maximization of information transmission within streams, both local processors became sensitive to the sign of the vertical bar. When the goal of learning was specified to be the maximization of coherence across streams, both local processors became sensitive to the sign of the horizontal bar. The CF connection strengths were then also learned correctly, thus embodying the cross-stream predictions.
Other simulations show that these networks have a rich array of possible behaviors depending upon the goal specified, the activation function used, the learning rate, the starting weights, and the correlations within and between the RF and CF inputs. When the goal is to maximize the RF input that is predictably related to the context, then this is done irrespective of whether that maximizes transmission of information about the RF. When the goal is to maximize transmission of information about the RF, then that is achieved irrespective of contextual predictability. Transition between these two goals is specified by varying a single parameter, a1, from 0 to 1.
The ability to discover the relevant RF variables and to ignore the irrelevant has been shown when i) the relevant variables are the most informative within streams; ii) when they are not the most informative within streams; and iii) even when there is no evidence within streams as to the existence of these particular variables (Kay & Phillips 1994, 1996; Smyth, Kay & Phillips 1994; Smyth 1994; Phillips et al. 1995b). These abilities have been shown within a variety of network architectures including i) networks with multiple streams and with contextual connections between streams; ii) multi-stream networks with two layers of processing and with contextual connections within streams and from higher to lower layers; iii) multi-stream networks in which the RF fields of different streams overlap with each other; and iv) multi-stream networks with contextual connections between neighboring streams only (Kay & Phillips 1994,1996; Smyth et al. 1994; Smyth, 1994; Phillips et al. 1995b). In all cases the correct contextual predictions are learned together with the discovery of the RF features that they relate. The networks learn faster with more streams, and they are sensitive to small correlations between streams (Phillips et al. 1995b).
In the example shown in Figures 6 and 7 the features that were correlated across-streams were the same as each other, i.e. the sign of a horizontal bar, and in the case of visual input the different streams are most easily thought of as arising from different places in the image. Neither of these aspects is crucial, however. The features that are mutually predictable could arise from different features in different streams, from the same cues to a common underlying variable at different places in the image, from different cues to a common underlying variable at the same place in the image, and can also include cross-modal contextual input.
The approach has been developed to include local processors with multiple output units, as shown in Figure 8 (Kay et al. 1996; Floreano et al. 1995). Different units within a processor adapt their RF weights so as to transmit information about different variables, thus increasing the amount of information that each processor can transmit. These networks thus have four levels of organization; units, processors, streams, and layers. Local codes were produced for the relevant variables in the simulation shown in Figures 6 and 7, where there was little room for any other form of coding. When more than one variable is relevant within streams, and when multi- unit processors are used a greater range of possible codings exists (Kay et al. 1996; Floreano et al 1995). The codes produced vary but are not reliably related to the input variables in any simple way. Simple input variables are sometimes signaled by single units, and sometimes they are not. In short, single units within multi-unit processors do develop selective tuning functions, but these are not in general related to the input in an intuitively obvious way, and they vary across streams and different instances of the same network architecture and input pattern set. What matters is that the relevant information be transmitted. How that information is distributed across the available output signals is not crucial. In relation to the study of receptive field selectivity in cortical neurons this suggests that the information conveyed by a population of cells may be more important than the exact way in which it is distributed across the individual cells, and this is consistent with the rich variation in detailed selectivity that is often observed in cortical neurons.
To show how contextual connections can produce coherent grouping Dario Floreano has simulated large arrays of multi-unit processors, and has studied the effects of the CF input on the short-term processing dynamics. In the first simulation 25 streams of four-unit processors were arranged as a 5 x 5 array, with each stream receiving RF input from a 3 x 3 array of units. All four units in each stream received contextual input from all units in their neighboring streams (Figure 8). The training input was collinear horizontal, vertical and diagonal bars displayed upon the 15 x 15 input array. The learning algorithm was set to maximize coherence across streams (i.e. Coherent Infomax, Section 3.2). Learning was found to scale-up successfully to this case, with all streams tending to discover the relevant input variables at around the same time (Floreano et al. 1995) .
The influence of the CFs after learning can be seen in Figure 9, which shows the effects that are produced on one stream of processing by supportive or opposing activity in other streams. The output probabilities for the four units of the processing stream at the centre of the 5 x 5 array when presented with a horizontal bar on its 3 x 3 receptive field are shown for four cases: i. when the RF input is strong; ii. when the RF input is weak and there is no contextual input; iii. when the RF input is weak and there is supportive contextual input; and iv. when the RF input is weak and there is opposing contextual input. Note that the only way that the context can influence output in this architecture is via the contextual connections. The results show that these contextual inputs increase the probability of outputs that are coherently related to that context and decrease the probability of opposing outputs. These effects are produced rapidly, within just one or two iterations of the computations that update the outputs.
The second simulation is analogous to demonstrations such as the Rubin vase and the Necker cube, where two different perceptual organizations are possible but only one occurs at a time. Such phenomena might reflect the effects of mutual contextual guidance in cases where the input provides evidence for both of two internally coherent feature sets that are mutually incompatible, so to study the short-term processing dynamics in such a case a net with 100 single-unit streams was simulated. For the sake of this demonstration, two sub-sets of nine units were specified such that each had positive CF input from all other units within the same sub-set and negative CF input from all of the units in the other sub- set. All streams received inputs that varied randomly across iterations with the constraint that inputs to both sub-sets of nine units were positive.
The output probabilities produced at each iteration of the computations updating the activity of the network are shown in Figure 10. As these iterations involve just one synaptic delay, each iteration corresponds to just a few msec. A coherent sub-set of features emerges from the background within 3 or 4 iterations, but only one of the two alternative organizations emerges at any one instant. Within a cooperative sub-set all outputs emerge from the background simultaneously, and they are less affected by random variation in their inputs than are the responses to the background. These effects are similar to the retrieval of memories in recurrent auto-associative attractor networks (e.g. Hopfield 1982; Amit 1989), but with the important difference that contextual connections just organize the input data into coherent sub-sets without adding features for which there is no evidence in the input. The short-term dynamics of nets with contextual guidance and a feedforward RF connectivity is therefore constrained to remain close to the input. (see Note 7)
The simulations described above used very simple nets, but models of large and complex nets that combine contextual integration with many biological details show that they can preserve the capabilities required of the short-term dynamics. (see Note 8) As far as the contextual guidance of learning is concerned, theoretical considerations and our simulations suggest that this may become easier rather than harder in larger systems, because more streams can provide better guidance.
We have so far had only limited success in using the learning rule outlined in Section 3.3 to discover arbitrary non-linear functions, such as the exclusive-or, in nets with two layers of feedforward RF weights. Nets with feedback contextual guidance from higher to lower layers can sometimes discover such functions, and they do so more often with more streams of processing (Phillips et al. 1995b). They do not solve this problem reliably, however, and one reason for this is that when units in higher layers compute non-linear functions then the feedback predictions conflict with what the units in the intermediate layers are able to compute. Others have shown that learning by maximizing coherence across streams can discover useful higher-order functions when applied to real-world problems, however (de Sa 1994 a,b; Stone & Bray 1995; Stone 1996; see Becker 1996 for a review and further applications). These algorithms do seem more limited in what they can learn than supervised algorithms such as error back-propagation, but, as we will argue in Section 6.4, that does not necessarily make them less plausible as analogies to self-organization in the cortex.
Important outcomes of these computational studies are as follows. (i) The goal of maximizing the transmission of contextually relevant information can be specified precisely within the framework of information theory. This can be done despite the antithesis between information and meaning that is described so clearly by Hamming (1980, p 103), and which has limited the usefulness of information theory to psychology and neurobiology (Horgan 1995). The approach being developed here may therefore help extend the application of information theory to brain function beyond the sensory systems. (ii) Feature discovery and associative learning can cooperate in such a way as to discover variables that are predictably related across diverse data-sets without needing a supervisor that already knows about those variables. (iii) It is possible for the output of a local processor to be affected by contextual input while still transmitting unambiguous information about the RF input. (iv) The form of learning derived analytically from the information-theoretic goals adds further support to the hypothesis that changes in synaptic strength depend non- monotonically upon post-synaptic activity in approximately the way proposed for the BCM and ABS rules (Bienenstock et al. 1982; Artola et al. 1990; Hancock et al. 1991a;b).
Here we outline evidence for context-dependent synchronization of activity in the cortex, for cortico-cortical contextual connections that are involved in this synchronization, and for plasticity of the receptive field and contextual field connections. For more detailed reviews see Singer (1990, 1993, 1994, 1995) and Singer and Gray (1995).
Intracolumnar interactions are shown by simultaneously recording the activity of cells within a small region of cortex. Synchronization of neighboring cells (<200um apart) has been observed in many different species and cortical regions of awake and anaesthetized animals, and can be observed in the local field potential (LFP) as well as in the multi-unit and paired single-unit recordings (e.g. Toyama et al. 1981; Michalski et al. 1983; Ts'o et al. 1986, Gray & Singer 1989; Kreiter & Singer 1992; Gray & Viana Di Prisco 1993). Synchronization of neighboring cells with overlapping RFs and feature selectivity sometimes reflects common thalamic input, but is more often characterized by dynamic properties that can only be accounted for by reciprocal interactions via local intracortical connections. Overall, the evidence suggests that the activity of local neuronal groups of cells is often closely synchronized.
Intercolumnar interactions are shown by simultaneously recording the activity of cells in different parts of the cortex, and synchronization has been observed between cells that are far apart (e.g. >2mm). In that case it occurs predominantly between cells with similar receptive field selectivity, and it reduces with distance (e.g. Michalski et al. 1983; Ts'o et al. 1986; Gray et al. 1989; Schwartz & Bolz 1991). Its occurrence within and between visual areas depends upon whether the cells being observed are stimulated by single or separate objects. For example, synchronization is strong when two cells in V1 with non-overlapping but collinear preferred orientations are stimulated by a single long bar moving across their RFs (Gray et al. 1989). It is weaker when they are stimulated by two short bars moving in the same direction, and it is abolished altogether when the two short bars move in opposite directions. These and many other results support the view that the synchronization of distributed activity in the visual system implements the well established Gestalt principles of perceptual grouping.
The prediction that cells can be part of different groupings at different times depending upon the stimulating conditions has been tested in the primary visual cortex of the cat (Engel et al. 1991) and of the awake behaving monkey (Kreiter & Singer 1994). These experiments show that when two cells with different orientation and direction preferences are stimulated by a single moving bar that is sub-optimal for both then they synchronize, but when they are stimulated by two separate bars, each being optimal for one of the cells, then they do not. Synchronization occurs within the secondary visual area MT of the awake behaving monkey, and depends upon whether the cells are activated by a single common stimulus or by two different stimuli (Kreiter & Singer 1996). Synchronization has also been observed within and between a variety of other cortical regions, including olfactory, somatosensory, and motor regions, as well as across hemispheres (Singer & Gray, 1995).
The specific thalamic afferents to primary visual cortex, V1, provide examples of RF inputs, and the excitatory long-range horizontal collaterals connecting pyramidal cells in V1 with non- overlapping RFs provide an anatomical basis for the CF inputs (Gilbert 1995). Long-range horizontal connections are common in V1 (Rockland & Lund 1983; Gilbert and Wiesel 1989; McGuire et al. 1991) and in other cortical regions (Gilbert 1992), and these connections have a synchronizing action (Lowel & Singer 1992; Konig et al. 1993). It has also been shown that interhemispheric connections have a specific role in synchronizing activity (Engel et al. 1991a). The descending connections from higher stages may also include signals that have a synchronizing role. Such connections are ubiquitous but do not seem to play a primary role in driving the cells to which they project. In accordance with this suggestion it has been found that the activity of cells at different stages of visual processing can be synchronized (Engel et al. 1991b), and that cells at later stages of processing in the visual system can synchronize the activity of relevant sub-sets of cells at earlier stages of processing (Bullier et al. 1992; Sillito et al. 1994).
The anatomical and physiological evidence therefore suggests that the contextual connections within and between regions of the visual cortex are organized as shown in Figure 11. These connections are not distinguished just by their source, but also by the effect that they have on the processors to which they project, because they have a modulatory rather than a primary driving role. One way in which they could fulfill this role is through voltage-gated receptors. Synaptic receptors that are both ligand and voltage-gated have become known as NMDA receptors, and they are widely distributed on pyramidal cells throughout the cortex. These receptors provide a mechanism for voltage dependence because there is a magnesium block on them that is reduced by depolarizing the cell (e.g. Ascher et al. 1988). These channels therefore contribute more effectively to further depolarization when the cell is already partially depolarized, and so they provide a mechanism for gain control. Fox et al. (1990) show that cells in cat visual cortex have one class of receptor channel which provides the primary drive and which summate linearly, and a second class that provides amplifying gain-control (see Fox and Daw (1992) for computational studies of possible mechanisms for these effects). If the long-range horizontal collaterals do provide synchronizing contextual input as hypothesized here then their synaptic inputs should be predominantly modulatory rather than driving. The available evidence suggests that this is so (e.g. Hirsch & Gilbert 1991). Furthermore, if these long-ranging intra-regional connections did contribute to the structure of the receptive field proper, then the receptive fields of cortical neurons would be much larger and more broadly tuned than they actually are.
Note that the hypotheses developed here do not imply that all voltage-dependent channels mediate CF rather than RF inputs. If any such distinction is relevant to cortex it is more likely that RF inputs produce strong activation of both voltage-dependent and non-voltage-dependent channels (Armstrong-James et al. 1993), whereas CF inputs produce strong activation only of voltage dependent channels. Note also that the absolute division of inputs into either RFs or CFs is a simplifying idealization. In cortex individual inputs may contribute to both roles but to varying degrees. Furthermore, there is evidence that although the long- range horizontal input is usually modulatory it can become more effective in generating spiking activity itself when the primary RF input is removed for many weeks (Das & Gilbert 1995).
It is now well established that the activity-dependent self- organization of synaptic connections could provide a substrate for learning in the cortex (Singer 1987, 1990). This is likely to involve long-term potentiation (LTP) and long-term depression (LTD), as well as control by global gating systems, and it applies to mature as well as to developing cortex (Singer & Artola 1994).
The learning rules formally derived from the information- theoretic objectives in Section 3 require synaptic strength on active inputs to remain unchanged when post-synaptic activity is very low, to decrease when it is at intermediate levels, and to increase when it is high. The plasticity observed in slices of adult rat neocortex by Artola et al. (1990) supports these three specific predictions. Furthermore, it has been shown that much of the data on activity-dependent self-organization in the visual cortex can be explained by the BCM learning rule (Clothiaux et al. 1991), which makes the same three predictions.Given the contextual input to local processors the possibility arises for this input to affect RF learning. Indeed, it is unlikely to have no effect, and we have argued above that it could have effects with far-reaching computational consequences. We know of no empirical studies explicitly designed to explore these possibilities, but results reported by Gilbert and Wiesel (1990) may be relevant. Their studies were mostly concerned with the effects of concurrent context upon the response to RF stimulation. Such effects could therefore be due to context modulating post-synaptic activity, but without having any effects upon the strengths of the synapses that carry RF input. However, it was also observed that, under some conditions, prior contextual stimulation altered the orientation tuning function that was later obtained in response to RF stimulation alone. This suggests that the prior contextual stimulation played a role in changing the strengths of RF synapses, but as it could also have been due to other adaptation effects studies using a modification of their paradigm to address this issue more directly would be worthwhile.
Although the basic organization of the CF connections could be genetically specified, shaping by experience is also necessary because RF feature selectivity depends upon experience. In keeping with this it has been shown that long-range horizontal collaterals undergo activity-dependent changes in synaptic strength (Lowel & Singer 1992; Hirsch & Gilbert 1993). Furthermore, there is also evidence that the selection of these connections follows a correlation rule establishing preferential coupling between cells exhibiting correlated activity (Lowel & Singer 1992), in agreement with the predictive role proposed here for the CF connections. These ensemble-forming connections remain susceptible to use-dependent modifications in the adult (Singer & Artola 1994). Indeed, most of the experiments demonstrating LTP and LTD in neocortical synapses have been performed on cortico-cortical connections terminating on pyramidal cells in layers II/III or V (Artola & Singer 1993; Singer 1995), so they could predominantly reflect the plasticity of CF connections.
This section shows how our hypotheses concerning local contextual integration can be tested and developed by behavioral methods, including cases where these are combined with physiological methods. We will argue: i) that studies of the detection and grouping of simple stimulus elements provide behavioral evidence for contextual integration of the kind that we propose; ii) that it is reasonable to search for such processes at the higher levels of cognition, such as the perception of objects and words, because they are implemented by mechanisms that are widely distributed and of general utility; and iii) that there is already theoretical and empirical support for the view that contextual integration at the level of local processors is relevant to these higher levels of cognition.
Our focus is on the visual perception of simple line element displays and on the perception of words. Theories using contextual integration and synchronization have been applied to a wide variety of other tasks with psychological relevance, however; e.g. the cocktail-party problem (von der Malsburg & Schneider 1986); perceptual grouping within and between multiple visual feature domains (Wang et al. 1990; Eckhorn et al. 1991a; Schillen & Konig 1994 ; Tononi et al. 1992b); form from motion and motion capture (Tononi et al. 1992b); object recognition ( Tononi et al. 1992b; Neven & Aertsen 1992; Hummel & Biederman 1992); selective attention and scene perception (Goebel, 1993); the binding of events across widely distributed cortical zones (Damasio 1989); reasoning (Shastri & Ajjanagadde 1993), and consciousness (Crick & Koch 1990). Although these theories differ from each other in detail they all suggest ways in which contextual integration at the level of local circuits can produce useful cognitive capabilities.
If synchronization is relevant to behavior then stimuli should be more perceptible if they produce synchronized activity, and conditions that reduce synchronization should impair perception. These predictions have been tested by comparing the effects of induced strabismus, i.e. squint, in cats on both synchronization and behavior. In strabismic cats neurons driven by different eyes lose the long-range horizontal intracortical connections that initially connect them (Lowel & Singer 1992). As a consequence these neurons do not synchronize, and the cats cannot fuse images from the two eyes (Konig et al. 1993). Furthermore, strabismus often leads to impaired perception in one eye, i.e. amblyopia, but the cortical activity evoked by input to that eye has so far seemed to be normal. The discrimination of gratings by cats using either their normal or their amblyopic eye has now been shown to be closely related to the extent to which the gratings produce synchronized activity (Roelfsma et al 1994a,b). Both discrimination and synchronization are reduced in the amblyopic eye, and in both cases this reduction is greater at higher spatial frequencies. Recordings from the primary visual cortex of awake strabismic cats show that the amount of synchronization is directly related to perception and motor control. Under conditions of stimulation that lead to binocular rivalry, neurons connected to the eye that dominates perception and oculomotor response show increased synchronization and neurons connected to the suppressed eye show decreased synchronization (Fries et al. submitted). Changes in perceptual dominance are unrelated to changes in firing rate, however. These results show that response selection is closely related to synchronization, and they thus support the view that internal grouping through synchronization on a fine time scale is important for the selection of perceptually or behaviorally relevant signals.
Psychophysical evidence for dynamic grouping through a network of local linking connections between feature detectors is reported by Field et al. (1993). Subjects were shown arrays of 256 oriented band-pass line elements (Gabor patches) and had to detect a path of 12 elements with gradually changing orientation that was embedded within the random background. They found that performance was impaired by increases in the distance between the line elements and with the deviation of their orientations from collinearity, but it did not depend upon their relative "phases" (i.e. black on white or vice versa). They conclude that the ability to detect such paths is due to local "association" fields that link feature detectors in an organized way that depends upon their relative RF selectivities (Figure 12). Their results show that these connections link feature detectors over distances that are large compared with the sizes of their receptive fields, and do so in such a way as to implement Gestalt grouping principles of proximity and continuity.
The dynamic grouping observed by Field et al (1993) supports the hypothesis of locally specific processes of integration, and the "association" fields that they infer from their findings are much the same as the CFs that we propose. Field et al (1993) note the similarities between the conditions under which they find good perceptual grouping and the conditions producing synchronization in V1. They also note that these conditions seem well matched to those determining the extent to which pyramidal cells are linked by long-range horizontal collaterals. These similarities are further strengthened by detailed comparisons between perceptual grouping criteria and the anatomy of the tangential intracortical connections which show them to be closely matched (Schmidt et al. 1996).
Finally, Field et al (1993) argue that detection of the path in their studies must have been mediated by the grouped activity of the set of detectors activated, rather than by the activity of a single high level detector because a new path was formed randomly on each trial, and because the bandpass nature of the stimuli precluded their detection by cells with classically defined RFs covering the whole of the path. This argument is of relevance to the issue of the relative roles of local and distributed codes to which we will return below.
The psychophysical evidence just discussed provides evidence for local contextual fields in the grouping of easily detected stimuli. If such fields exist then they could also mediate locally specific effects of context on the detection of faint or ambiguous stimuli. Several psychophysical experiments indicate that this is so (Polat & Sagi 1993, 1994a,b). Such studies show that targets are surrounded by a small region within which additional stimuli suppress target detection, and then by a larger region within which they facilitate detection provided that they are coherently related to the target such as by being collinear or near collinear. They thus suggest that within local streams of processing inhibitory mechanisms force a choice between alternative features, whereas between streams contextual interactions facilitate the detection of coherent features.
A detailed comparison of psychophysical and physiological evidence for such facilitatory effects of context on the detection of target elements is reported by Kapadia et al. (1995). They show that human visual contrast sensitivity is improved by a neighboring suprathreshold line element in a way that is reduced by increases in their spatial separation and with the deviation of their orientation selectivities from collinearity. Using equivalent stimuli in electrophysiological studies Kapadia et al. (1995) also show that the response of superficial layer complex cells to low contrast stimuli in V1 of awake attending Rhesus monkeys, as measured by summing spikes over 200 msec, depends upon the local relations between target and context in a way that is very similar to that seen in the psychophysical experiments. The physiological experiments also showed that these effects were not due to the context encroaching within the RF of the recorded cell, but were due to modulatory interactions between cells with non-overlapping RFs.
The contextual conditions producing enhanced detection of low contrast stimuli in both the psychophysical and the electrophysiological studies of Kapadia et al (1995) are very similar to those producing grouping of high contrast stimuli in the psychophysical studies of Field et al (1993) and synchronization in electrophysiological studies (e.g. Singer & Gray 1995). All four sets of findings show similar effects of spatial separation, spatial frequency, orientation, and collinearity. They may therefore be of great importance in reflecting common underlying mechanisms for local contextual integration. This view is further strengthened by their close match to the anatomy of long-range horizontal collaterals (e.g. Schmidt et al. 1996). If such mechanisms do indeed exist then they will provide an obvious candidate for explaining many other locally specific effects of context. The hypothesis that grouping and contextual facilitation of element perception involve common processes will be discussed further below in relation to theoretical and experimental studies of contextual integration in word perception.
One of our main hypotheses is that context can modulate the transmission of information about something other than the context. Thus, if some input variable is used only for contextual guidance then no explicit information will be transmitted about that variable in its own right.
A dramatic demonstration of this possibility comes from two neuropsychological patients who no longer see the world in color but only in black and white. The first, HJA, has been studied for many years (Humphreys & Riddoch 1987), but it has now been discovered that color does have implicit influences on his detection of luminance contrasts (Humphreys et al. 1992). In one task he had to say whether the top and bottom halves of a rectangular display differed in the second or the first of two intervals. Sometimes the two halves differed only in luminance, sometimes only in color, and sometimes in both. When the color difference was presented without any luminance difference then performance was at chance. This shows his achromotopsia. When the two halves differed only in luminance, performance improved as luminance contrast increased. When a color difference was added, performance improved more rapidly than it did with the luminance contrasts on their own. The second patient, WM, was also tested on these tasks, and was compared in detail with HJA on a variety of other tasks. He has a different form of achromotopsia, but shows the same kind of implicit modulatory effects of color differences upon the detection of luminance differences (Troscianko et al. 1993; Troscianko et al. 1996).
A simple interpretation of these results is that, in these patients, color streams continue to modulate luminance streams, but with their own feedforward outputs no longer functioning properly. Color differences can therefore still influence the detection of luminance differences, but without themselves being perceived. Color differences may influence the detection of luminance differences because they are highly correlated in natural visual images, and this is reflected in the connectivity between color and luminance channels. The correlations are not used to conflate the two variables, however, but to provide contextual guidance. Further evidence for facilitatory effects of color can also be found in psychophysical studies of normal subjects (Gur & Akri 1992; Troscianko 1994). The above findings therefore illustrate a key difference between RFs and CFs. Outputs are "seen" by later stages of processing not as conveying information about the CF input, in this case color, but as conveying information about the RF input, in this case luminance. The introspective reports of the two patients described above suggest that in this case those later stages involve conscious awareness.
To show how analogous effects can be sought using normal subjects we outline experiments that are currently being run at Stirling by Craven et al. on the interaction of target and contextual cues to texture segregation. A 20 x 20 array of small line elements, divided into two halves differing in mean line length, is displayed for 1 sec, and Ss decide whether the display is divided into long and short elements by a vertical or a horizontal boundary. Contextual input is provided by also dividing the array into two halves that differ in the mean orientation of the elements. This boundary is coincident with the length boundary on 70% of trials and is orthogonal to it on 30%. If modulatory interactions are occurring then: i) the effect of context will increase as target strength increases; ii) this will occur just for a low range of target strengths; and, iii) this range will be at higher values of target strength for weak than the strong contexts (Smyth et al. 1996). (see Note 9) We also predict that the perception of the target boundary will be facilitated by a context with which it is coherent. Results obtained so far support these predictions, and include modulatory effects of weak contextual cues that have no direct influence on response themselves.
Until the early 1980s plasticity in sensory systems was widely thought to be restricted to a critical period during development. There is now abundant anatomical, physiological, and psychophysical evidence for such plasticity in adults. Plasticity has been shown in visual, auditory, somesthetic and motor systems. It is particularly dramatic at the cortical level, and has been shown by studies of reorganization following deafferentation, nerve section, and cortical lesions, and from studies of the effects of more subtle changes in the patterns of sensory input received (Kaas 1995; Gilbert 1995). These effects occur on various time scales, and include psychophysical evidence for fast perceptual learning in the visual sensory system of adult humans (e.g. Karni & Sagi 1991; Poggio et al. 1992; Polat & Sagi 1994b). These psychophysical effects are interpreted as being due to changes at sensory stages of processing because they are specific for eye, orientation, and spatial frequency, as well as for spatial position.
Some of these findings can be interpreted as changes in feedforward RF selectivity, such as when major reorganizations of sensory or motor maps occur. In other cases they are more likely to involve changes in the connections mediating intracortical contextual integration (Gilbert 1995). Consider the psychophysical findings reported by Polat and Sagi (1994b). Prior to practice the spatial range within which contextual stimuli facilitate target detection is up to about six times the spatial period of the target (Polat & Sagi 1993). After a few hours of practice, covering the whole range of separations to be spanned, the range of facilitation was increased by at least a factor of three (Polat & Sagi 1994b), probably by strengthening chains of local facilitatory interactions between filters with nearby but non-overlapping RFs.
There are also other ways in which psychophysical experiments could study the learning of contextual predictions. Consider, for example, the techniques for studying local contextual integration described in Sections 5.2, 5.3, and 5.5.2. Each of these could be used in paradigms where the predictive relationships between context and target are manipulated to see whether the effects of local context depend upon experience. There are already some findings suggesting that such studies would be worthwhile; e.g. i) the effects of strabismus outlined above show that experience affects both the CF connections and the probability of synchronization; ii) further studies of the achromotopsic patient WM suggest that the color- luminance interactions adapt to correlations that are experimentally induced between them (Troscianko et al 1995); and iii) there are large practice effects in the texture segregation task outlined in Section 5.5.2.
An example of the effect of learning on cross-modal integration is provided by Durgin (1995) who presented random dot patterns that had a greater dot density on either the left or the right. A tone was presented simultaneously and its pitch was perfectly correlated with the side of greater density. After 180 such pairings, a staircase procedure was used to measure perceived equivalence between left and right at an intermediate density. Simultaneous presentation of a tone affected matching such that the side that had been more dense when that tone was presented during training was seen as being more dense than it should have been to give an accurate match. The extent to which such cross-modal contextual learning affects discrimination within modalities when later tested in the absence of the cross-modal stimulus is not yet clear, but it is clearly amenable to further psychophysical study. Physiological evidence for effects of auditory stimulation on activity in the visual cortex (e.g. Spinelli et al. 1968; Fishman & Michael 1973) encourages such experiments. Psychophysical studies of learned cross-modal effects upon grouping would be of particular interest given the evidence suggesting that contextual interactions play a major role in grouping.
In sum, there are large effects of learning in both mature and immature sensory systems. These effects often include changes in RF connectivity, but changes in CF connectivity are also likely, as expected on the grounds that the contextual predictions would otherwise be invalidated by RF changes. Finally, psychophysical experiments designed specifically to study changes in contextual integration due to learning are both possible and worthwhile.
Ways in which binding through synchronization could help produce object recognition performance similar to that of humans has already been discussed in detail elsewhere (e.g. Hummel & Biederman 1992; Mozer et al 1992) so we restrict ourselves to a brief outline. One central idea is that shape recognition could generalize well across irrelevant dimensions, such as position, if shape descriptors are insensitive to those dimensions. Synchronization is used to bind shapes to positions to show what shapes are where. The Hummel and Biederman (1992) model has seven layers through which image features are combined into parts, and then into structural descriptions of objects in terms of their parts and relationships. Synchronization is used to group image features into volumetric parts, and to bind parts to relationships, and is achieved through a network of fast enabling links that are similar to our CFs. Crucial aspects of human performance displayed by the model include recognition that generalizes well across position, size, left-right reflection, and rotation in depth, but poorly across rotation in the picture plane. Goebel (1993) developed a similar model that has more flexible synchronizing connections and which also incorporates mechanisms that produce performance consistent with psychophysical evidence for selective spatial attention. Like human performance, such systems are highly dependent upon internal grouping processes such that the same image grouped in different ways can give rise to very different outputs. These demonstrations of computational feasibility and similarity to human performance are encouraging, but direct tests of the hypothesis that timing on a fast time scale is important for grouping in object perception are also needed, and this is a major goal for further research.
Word perception is a major focus for studies of contextual integration within cognitive psychology, so here we discuss several ways in which the two areas of research can be related. We note similarities between context effects in word perception and in the perception of simple line elements, and conclude that as the latter can be studied by both psychophysical and physiological methods this may bring a rich new source of evidence to bear on cognitive conceptions of contextual integration. We relate this to dynamic grouping and to synchronized population codes, and outline neuropsychological evidence for such codes in word perception. Finally, we note a possible role for contextual guidance in learning to perceive words. Various paradigms have shown that both letter and phoneme perception depend upon local context. Cattell (1886) showed that letters are recognized more accurately in the context of a familiar word, and the many subsequent studies of such phenomena include demonstrations that forced-choice discrimination between a pair of prespecified letters is better if the test letter appears within the context of a familiar word or pronounceable non-word (e.g. Reicher 1969; Johnston & McClelland 1973; Rumelhart & McClelland 1982). Strong effects of local context also occur within speech perception. For example, Massaro & Cohen (1983) presented computer generated syllables such as /sli/, /tri/, /sri/, and /tli/, to subjects and asked them to classify the middle phoneme as an /l/ or an /r/. This phoneme was presented at seven different levels on a continuum from being very /l/-like to being very /r/-like by varying the frequency from which the third formant (F3) began at the phoneme's onset. Each was factorially combined with one of four different leading consonants /s/, /t/, /p/, /v/. Perception of the central phoneme was predominantly determined by the direct acoustic cue to that phoneme as given by F3 frequency, but it was also affected by the preceding consonant, particularly when the direct cue was most ambiguous. Similar effects have been shown in studies of reading. For example, when an ambiguous lower-case letter that can be read as either an e or a c is placed in contexts that support one or the other alternative then identification is biased towards the contextually appropriate alternative (Massaro 1979).
These effects are similar to those in the perception of simple line elements displays (e.g. Polat & Sagi 1993, 1994a,b; Kapadia et al. 1995) in several ways. Although they occur at a higher level of analysis context effects in word perception also occur rapidly and automatically. The effects are locally specific in that they depend upon the particular properties of the entities that are interacting. For example, the context /s/?/i/ supports some target letters and not others, just as the context of two collinear oriented line elements supports some intervening oriented line elements and not others. Furthermore, in both cases the interactions are such as to emphasize things that are expected in that context, rather than things that are unexpected. In both domains the effects of context on the perception of individual elements seem to be greatest when the direct stimulus evidence is most ambiguous. Yet another similarity is that contextual interactions in both domains are affected by learning. Finally, there are effects of object-specific knowledge on the detection of line segments (e.g. Weisstein & Harris 1974; McClelland 1978) that may be analogous to the effects of word-specific knowledge on the forced-choice discrimination between letters. At present, therefore, there seems to be enough similarity to justify the search for a common explanation for these various effects of context.
Theories of contextual interaction in word perception have been undergoing vigorous development for many years and can now account for many details of performance with impressive precision (e.g. McClelland & Rumelhart 1981; Rumelhart & McClelland 1982; McClelland & Elman 1986; Richman & Simon 1989; Massaro 1989a). The problem is not that no explanation is available but that each of several diverse theories can fit the data so well that it is difficult to know what inferences to draw. Thus we need to find common elements, or fundamental mechanisms, that may appear in different forms in the different theories, and which are crucial to the cognitive functions with which they are concerned (Richman & Simon 1989). As the theories all aim at generality one way to do this is to widen the range of phenomena to which they are applied, including psychological tasks that can be related to known physiological mechanisms if possible. It is particularly appropriate to relate Interactive Activation and Competition (IAC) theories (e.g. McClelland & Rumelhart 1981) to anatomy and physiology because they explicitly use a neural style of computation, and played a major role in promoting the use of that style in cognitive theory. Basic aspects of these models that are broadly compatible with what is known about cortical physiology include having just a few different levels of analysis; having many replications of processors spanning feature space at different input positions within a level; and having local inhibitory relations to force a choice between incompatible alternatives within levels. For further in-depth discussions of these theories and related issues see McClelland (1991), Massaro and Cohen (1991), and Movellan and McClelland (1995).
We now discuss five major unresolved issues from the perspective of our general approach and of the analogy between context effects in the perception of words and of simple line elements: 1. What is the architecture of information flow, and in particular to what extent do the effects of context depend upon the feedback of information from higher to lower levels of analysis? 2. How does contextual information affect processing, and in particular does it do so in essentially the same way as direct evidence from the target? 3. Do the mechanisms producing the effects of context on the perception of ambiguous or just detectable stimulus elements also play a role in dynamically grouping those elements? 4. To what extent does each level use local as opposed to population codes for those entities with which it is concerned? 5. Is the goal of maximizing coherence between distinct streams of processing relevant to learning within streams? (see Note 10)
1. Where does the contextual information received by local processors come from, and in particular does it include feedback from higher levels of analysis? The detailed properties of context effects in the perception of line elements strongly suggest that they are due in part to long-range horizontal collaterals that directly link distinct entities within the same level of analysis (e.g. Singer 1995; Kapadia et al. 1995). If there are connections that are specialized to mediate contextual interactions within the segmental (i.e. phonemic or letter) levels of word perception, for example, then this could contribute to the superiority of regular over irregular nonwords, and may help explain some of the more rapidly occurring components of contextual interaction. Contextual connections within segmental levels may not be the best way to explain the effects of word-specific knowledge, however, and these are often assumed to involve activity at a higher level of analysis that is specialized to distinguish words. The question that then arises is whether activity at this level influences processing at the segmental levels. This issue is unresolved but an analogous issue can be studied at lower levels of visual processing using physiological techniques. Studies of the role of feedback from V1 to LGN provide evidence that it synchronizes the firing of those LGN cells that combine to form higher level entities in V1 (Sillito et al. 1994). This suggests that higher levels do influence processing at lower levels but in a way that is distinct from the ascending RF input and more like the process of contextual integration within levels. As feedback connections are ubiquitous within the cortex, this could be a general feature of cortical processing, so it would be worthwhile obtaining further evidence on it by combining psychophysical measures with electrophysiological measures in V1 as in Kapadia et al (1995), but measuring both rate and relative timing in multiple single unit recordings, and using stimulus elements that either do or do not combine together to form a single familiar entity at a higher level. The possibility that a major role of feedback is to group activity at lower levels will be examined further under point 3 below, which discusses grouping processes in word perception.
2. Is contextual information used in essentially the same way as target information in word perception? If not then contextual integration is not a fundamental issue within that domain. Note that as contextual interactions may be reciprocal the question is not whether some parts of a word are processed just as targets and others just as contexts, but whether words are composed of distinct but interacting parts such that the processes by which they interact differ from those by which they are kept distinct.
One of the clearest ways of distinguishing between contextual and direct stimulus effects in word perception is that it is the direct stimulus input and not the context which determines the alternatives between which choice is made; i.e. context influences the choice between those competing alternatives for which there is direct but ambiguous stimulus support. (see Note 11) As evidence for this view Massaro (1989b) notes that context does not by itself produce the phoneme restoration effect (Warren 1970) because some bottom-up support for the presence of a missing phoneme is required, even if only in the form of a brief noise burst. This possible asymmetry in the roles of context and target is not apparent in the experiment of Massaro and Cohen (1983) because target support was always available. It should therefore be possible to make it more apparent by presenting stimuli such as /li/, /ri/, /si/, /ti/, /sli/, and /tri/ with various levels of ambiguity for the /l/ or /r/ phoneme and asking subjects to decide whether an /l/ or /r/ or neither is present. The distinction being proposed here predicts that context will have a large effect in the presence of an ambiguous target while having little or no effect in its absence. In contrast, the direct stimulus cues should have large effects whether the context is present or not. Such a result would not be surprising but it would show how the asymmetry in the roles of RFs and CFs can be reflected in performance. Furthermore, this asymmetry in dependence would contrast with the various ways in which information from different sources can be combined such that they all influence decision in essentially the same way (e.g. Massaro & Friedman 1990), as do different sources of information from within the RF. An important aspect of this latter form of combination is that, although the different sources may be independent prior to combination, their individual contributions are not kept distinct in the output decisions produced. Thus, in contrast with the effects of CF inputs, all RF inputs help determine the meaning of the decisions to which they contribute.
The distinction between the roles of RFs and CFs may also be relevant to the long-running debate between theories that emphasize the use of internal knowledge to go beyond the input from external stimuli and theories that emphasize remaining faithful to that input so as to avoid hallucination. The latter danger is often noted in discussions of the effects of context in word perception (e.g. Massaro 1989a, Massaro & Cohen 1991), and can be used as an argument for assuming that context and target do not interact. If context has distinct effects upon processing, then instead of having to choose between avoiding hallucinations and allowing contextual interaction we can have both, including direct contextual interactions within levels, which might otherwise overwhelm stimulus processing with hallucination.
3. Sections 5.2. and 5.3 outlined evidence suggesting that for simple line element displays the grouping of elements into coherent wholes depends upon the same knowledge embodied in the same mechanisms as do the effects of context on the perception of the individual elements. If this is so, and if contextual integration is achieved in the same way in different regions then it will also apply to word perception. We take it for granted that grouping processes are a crucial part of word perception at both lexical and sub-lexical levels, and this is easily demonstrated. At the lexical level, for example: thismustbegroupeddynamicallyusingknowledgeofspecificwords. Internal grouping processes also occur at sub-lexical levels. PIGHAM, for example, will be pronounced either with or without consonants in the middle depending upon whether or not IGH is grouped to form one grapheme. If grouping is computed dynamically, as it must be if it is signaled by relative timing, then such groupings could change rapidly from moment to moment, thus making various possible alternative groupings successively available. Studies of a neurological patient, who will be discussed further below, illustrate the relevance of such dynamic grouping processes to word perception. She was quite unable to read PIGHAM as a single unfamiliar nonword, but could read it easily when she saw it as two familiar words (Goodall & Phillips 1994). She made it clear in various ways, e.g. by drawing a pencil line between the appropriate letters, that this involved feedback of grouping information from a lexical level to a level containing a precise topographic map of the individual letters.
There is also evidence from normal subjects that the effects of word familiarity on perception involve internal grouping processes. For example, familiarity reduces asymmetrical left-to-right letter position effects that can be explained as being due to processing letters separately in unfamiliar stimuli (Phillips 1971). Many other effects can also be explained as being due to processing familiar items as a single coherent whole, or `chunk', but processing unfamiliar items as a number of separate chunks (Richman & Simon 1989). One implication of these considerations is that in order to understand the role of feedback in word perception it may be worthwhile emphasizing tasks that reveal the effects of grouping processes. For example, if grouping and disambiguating involve a common mechanism then disambiguating interactions between elements should depend upon whether those elements are grouped together or not. Consider the disambiguating effects of local context in the experiments of Massaro and Cohen (1983), for example. In those experiments the phonemes that interacted were always part of a single word. If more than one word were presented then it would be possible to test whether or not the interaction between neighboring phonemes depends upon their being perceived as parts of the same word or phrase.
4. Are familiar words signaled by local or by population codes? A fundamental difference between these two possibilities is that population codes can transmit information about inner structure but local codes cannot. This difference was used to provide evidence on this issue by studies of two neuropsychological patients whose ability to read and write is very largely restricted to words with which they are familiar (Goodall 1994; Goodall & Phillips 1994; Phillips & Goodall 1994). Their reading and writing therefore provides a direct window on the contribution of lexical knowledge when isolated from sub-lexical