Representation, similarity, visual shape recognition, categorization, perception, features, invariance, mental models, affordance, constancy, distal/proximal stimulus, isomorphism.
Advanced perceptual systems are faced with the problem of securing a principled (ideally, veridical) relationship between the world and its internal representation. I propose a unified approach to visual representation, addressing the need for superordinate and basic-level categorization and for the identification of specific instances of familiar categories. According to the proposed theory, a shape is represented internally by the responses of a small number of tuned modules, each broadly selective for some reference shape, whose similarity to the stimulus it measures. This amounts to embedding the stimulus in a low-dimensional proximal shape space spanned by the outputs of the active modules. This shape space supports representations of distal shape similarities that are veridical as Shepard's (1968) second-order isomorphisms (i.e., correspondence between distal and proximal similarities among shapes, rather than between distal shapes and their proximal representations). Representation in terms of similarities to reference shapes supports processing (e.g., discrimination) of shapes that are radically different from the reference ones, without the need for the computationally problematic decomposition into parts required by other theories. Furthermore, a general expression for similarity between two stimuli, based on comparisons to reference shapes, can be used to derive models of perceived similarity ranging from continuous, symmetric, and hierarchical, as in multidimensional scaling [Shepard, 1980], to discrete and non-hierarchical, as in the general contrast models [Tversky, 1977; Shepard and Arabie, 1979].
A traditional answer to this question has been, for a long time, similarity. According to this view, which originated with Aristotle, an internal entity represents an external object by virtue of resemblance or isomorphism between the two: the representation of a tomato has something of the redness and of the roundness of the real thing.
Echoes of this idea, inherited by Berkeley and Hume from the Scholastics, can be found in present-day sources: ``Representation of something is an image, model, or reproduction of that thing'' [Suppes et al., 1994]. Clearly, no one these days believes that a representation of a cat in an observer's brain is cat-shaped (or striped, or fluffy); rather, it is construed as a set of measurements which collectively encode the geometry and other visual qualities of a cat. Nevertheless, the philosophical foundation of the current theories of shape representation is still isomorphism: typically, it is assumed that structural [Biederman, 1987] or metric [Ullman, 1989] information stored in the brain reflects corresponding properties of shapes in the world, on a one to one basis.
Apart from having philosophical problems [Cummins, 1989], this approach also presents a formidable computational challenge if the representation is to be veridical (i.e., if the geometry of each viewed shape is to be faithfully reconstructed from the proximal stimulus [Edelman, 1997]). Given the inherent imperfections and distortions introduced by the sensory channels (as manifested in the plethora of visual illusions), it is perhaps not too surprising that human perception of shape falls short of veridicality in a variety of tasks, such as the estimation of local surface orientation [Koenderink et al., 1996], local curvature [Phillips and Todd, 1996], or even object size [Gregson and Britton, 1990]. Now, it is certainly possible to learn fascinating lessons about the workings of the human visual system from the study of the cases in which it behaves nonveridically, nonlinearly, or downright peculiarly [Gregory, 1978; Gregson, 1988]. Nevertheless, the central goal of this target article --- understanding how representation is possible at all --- is probably better pursued by considering the cases in which the representations used by the visual system do lead to veridical perception. As we shall see, lessons that can be drawn from these cases suggest a philosophically appealing and formally veridical approach to representation that turns out to be computationally feasible.
How do people happen to be better judges of similarities among shapes than perceivers of shape? This state of affairs should be expected if the visual system seeks a second-order isomorphism [Shepard, 1968] between similarities among shapes and similarities among the internal representations they induce, instead of a first-order isomorphism between the shapes and their representations. Quoting Shepard and Chipman (1970, p.2), ``the isomorphism should be sought --- not in the first-order relation between (a) an individual object, and (b) its corresponding internal representation --- but in the second-order relation between (a) the relations among alternative external objects, and (b) the relations among their corresponding internal representations. Thus, although the internal representation for a square need not itself be square, it should (whatever it is) at least have a closer functional relation to the internal representation for a rectangle than to that, say, for a green flash or the taste of a persimmon.'' Essentially, this is a call for the representation of similarity instead of representation by similarity (see Figure 1).[Note 2]
Computationally, the problem of representation can be addressed on several levels (cf. Marr, 1976). On the abstract level, the concern is to come up with an appropriate mathematical formulation, one that would make the representation well-posed and tractable. The idea of second-order isomorphism does in fact lead to a well-defined computational notion of representation: according to this idea, to represent a collection of objects means to reflect in a consistent manner any change an object may undergo.
By and large, this notion of representation is conceptually orthogonal to the reconstructionist approach: the tokens standing for objects need not resemble the objects themselves (see Figure 1). Although representation by second-order isomorphism does reduce to plain reconstruction if the represented quantities correspond to distances among densely spaced points situated on the surface of an object,[Note 3] such a reduction is unwarranted; apart from placing a heavy computational burden on the perceptual system, it serves no useful purpose. As noted by Shepard and Chipman (1970, p.3), ``it only attempts the absurdity of putting off until later the whole process of pattern recognition that must by definition precede the pivotal event in question'' (i.e., the delivery of a representation capable of supporting perceptual judgment and categorization).
On the algorithmic level, representation by second-order isomorphism calls for ensuring that the similarities between (necessarily proximal) perceived entities correspond in some orderly fashion to the distal similarities between objects. Now, a mechanism tuned to a particular shape provides a convenient way to estimate the similarity between the current stimulus and a reference stimulus, if its response falls off monotonically with the extent of the (distal) deviation of the current stimulus from the preferred one. This monotonic relationship between proximal and distal similarities provides the requisite algorithmic basis for veridical representation: as in nonmetric multidimensional scaling [Shepard, 1962; Kruskal, 1964], the rank order of the proximal similarities, being the same as the rank order of the distal similarities, allows the recovery of the distal configuration of the stimuli in some underlying parametric space [Edelman, 1995b].
On the implementational level, the challenge, then, is to identify a mechanism (biological or artificial) capable of responding selectively to certain shapes. A generic connectionist classifier trained on the recognition of a particular class of objects provides the requisite implementational substrate; a particular classification architecture (namely, the regularization networks of Poggio and Girosi, 1990) may be preferred on the grounds of biological plausibility.
An adequate computational solution, spanning all three levels, would exert a decisive influence on the philosophical outlook on the problem of representation. At the very least, familiar dogmas would have to be reassessed, and the relative merit of competing proposals reevaluated. The developments of recent years in the computational, psychophysical, and neurobiological studies of visual representation suggest that the time for such a revision has come. In the remainder of this paper, I survey some of the relevant developments and suggest a way to relate them to some of the current views on the issue of representation in the philosophy of mind.
Similarity between objects can be defined via an embedding of the objects into a metric space where it is then determined by the distance between the points corresponding to each object. Rather than postulating a unique true distal similarity space for shapes (a notion that would appeal only to an extreme Platonist), I propose to consider an arbitrary space of the required kind, and to show later on that the exact choice of the space is not critical.[Note 4] What should be required of such a distal shape space? Under second-order isomorphism, changes of shape, not the shapes themselves, are to be represented. According to this view, changing a shape corresponds to a movement of the point encoding the shape in an appropriate parameter space. To allow metamorphosis within a certain class of objects, all the members of that class must admit a common parametrization.
Although modern computer graphics offer a number of approaches to a common parametrization for a very wide spectrum of possible shape morphing [Pentland and Sclaroff, 1991; Galin and Akkouche, 1996] (see also appendix A), it is unrealistic to expect that a structure of similarities common to extremely disparate shapes will carry over into a cognitive system (the need to judge the similarity between objects from widely disparate categories arises rarely, if ever). Different object classes may, therefore, be encoded by different sets of parameters.
To some extent, the ease with which a common parametrization can be constructed for a set of objects probably depends on the degree of their membership in the same natural kind [Quine, 1969] of shapes (say, quadruped animals), or in the same artificial shape category (office tables). If any shape were equally likely (for a ``medium-sized'' count noun object), the burden of representing the visual world would be, I suspect, much heavier.
A degree of caution is called for when interpreting this state of affairs. First, the applicability of multidimensional scaling is ultimately determined by the relevance of the resulting solution: ``Even though it is always the case that, if we are prepared to tolerate a high enough dimensionality and if we are prepared to tolerate degenerate, clustered, or lumpy configurations, we can get a spatial representation, ultimately, the criterion for accepting a representation is the sense that can be made of it, and the results that can be retrieved or predicted, by rules invariant over the space, from it'' (Gregson 1975, p.134).
Second, one should not assume too lightly that the internal similarity space is metric in the full sense used, in, say, differential geometry. In that space, as pointed out by Clark (1993, p.147), ``Distances are monotonically related to similarities, but there is no presumption that sums or ratios of distances are interpretable. There may be no common unit to express distances along different axes.'' Fortunately, in visual shape processing these concerns seem to be largely mitigated; in section 7, we shall see that both the metric space assumption and the applicability of MDS are justified by the human performance data in a variety of shape perception tasks.
The metric-space definition of internal similarity seems to fall short of explaining such prominent phenomena in the perception of similarity as subjectivity, task dependence, and asymmetry [Tversky, 1977; Tversky and Gati, 1978; Nosofsky, 1991; Medin et al., 1993]. These shortcomings are only superficial, however. In particular, while the metric-space model makes it possible to speak about objective distal similarity (a prerequisite for a realist ontology of visual shapes), the perceptual system of the observer can warp the objective similarity space, according to his or her or its idiosynchrasies, and to the dictates of the task [Harnad, 1987; Goldstone, 1994]. Furthermore, similarity need not remain restricted by the symmetry that it inherits from the underlying distance function; the metric-space model can be considered a starting point for a more realistic definition, of the kind proposed, for example, by Krumhansl (1978). Indeed, as I shall argue in section 5, a distance-based definition of similarity does not preclude modeling a considerable variety of similarity-related phenomena in human perception.
The possibility of a principled quantification of both the distal and the proximal shape similarity addresses the first problem faced by the proposed theory of representation: what to represent. The next question --- how to communicate similarity relationships induced by a given distal shape space structure across the gap separating the world from the observer --- is addressed in the following section.
The above hierarchy is clearly not the only possible way to define the fidelity of the representation mapping. If the representation is to be used mainly for classification, one may require points that are separable under some parametric decision surface in the original space to remain so following the mapping (this is in contrast to the distance-based requirements, which are nonparametric). For example, if points in the original shape space tend to form linearly separable clusters, one may require that the clusters remain linearly separable under the mapping. Moreover, one may also require that clusters that are not originally linearly separable become so under the mapping [Cortes and Vapnik, 1995]. These considerations are beyond the main concern of the present section, which is to specify a minimal computational basis for the processes that operate on the representation space. Still, if the original-space configuration of stimuli allows an efficient remapping that makes explicit an underlying structure of linearly separable clusters, this possibility must remain open following the mapping into the representation space. Whereas the lowest-fidelity (distinction-preserving) representation does not necessarily preserve such properties, the highest-fidelity (similarity-preserving) representation clearly does.
Locally, the rank preservation requirement is satisfied by any well-behaved (that is, smooth and invertible) mapping [Cohn, 1967]. Such mappings are conformal, that is, they preserve angles, and, therefore, also the similitude of small triangles (see appendix B). In particular, a scalene triangle formed by a triplet of points in a distal shape space will be mapped into a triangle with the same ranking of side lengths in the proximal representation space (see Figure 1).
Geometry. The function f1 maps the distal parameter-space description of the object into its geometry (e.g., the coordinates of the vertices of a fine mesh, suitable for rendering by a graphics system).
Imaging. The function f2(p;z) maps the object's geometry into the image on the receptor surface of the visual system. Its dependence on the shape parameters p is determined by the prior action of f1 and is written down explicitly for convenience; the dependence on the viewing conditions z is, however, peculiar to f2.
Measurements. The function f3(p;z) corresponds to the set of internal measurements performed on the image. In a typical model of biological vision, each measurement stage consists of a convolution with a number of filters, followed by the application of a nonlinearity.
Dimensionality reduction. The function f4(p) maps the measurement space into a low dimensional representation of the shape space, while removing the dependence on the viewing conditions z. The low dimensionality of the ultimate internal shape space reflects the corresponding characteristic of the distal parameter space; it is also important for reasons of computational tractability [Edelman and Intrator, 1997].
Note that the second component of M -- the view mapping, f2 -- introduces a dependence on variables z which are extraneous to the shape parameters that are to be represented. These variables encode the orientation of the object with respect to the observer, to the light sources, and to the other objects in the scene. Their influence must be counteracted by the perceptual system, through the combined action of measurement and dimensionality reduction, f4.f3, to reduce the likelihood that two nearby parameter-space points (i.e., two similar shapes) are mapped into widely disparate points in the final representation space. Absolute invariance with respect to these variables is not necessary; it is only required that changes in shape space influence the measurements more strongly than view-space changes (Edelman and Duvdevani-Bar, 1997b; more on this in section 4). Furthermore, not all the dimensions of z have to be treated by the same mechanism: image-plane translation can be compensated for by a covert shift of attention [Anderson and Van Essen, 1987] or an overt one (such as a saccadic eye movement), variation in apparent size --- by global scaling using a hard-wired mechanism [Schwartz, 1985], and rotation in depth --- by learning an appropriate normalizing mapping specific for each object class [Poggio and Edelman, 1990; Lando and Edelman, 1995].
As pointed out above, the preservation of distance ranks implies that any change in the distal parameter space must be reflected in the final low-dimensional representation (if some of the original dimensions collapse under the representation, distances between points are likely to be distorted). To ensure that as many as possible of the original dimensions of variation among the distal objects are preserved, it is worthwhile to make as many varied measurements as possible. This makes the measurement space (defined by the action of f3) high-dimensional, and necessitates subsequent dimensionality reduction (through the action of f4). In a flexible system, dimensionality reduction would have to involve learning to find informative dimensions, depending on the statistics of the input and (if available) on additional knowledge provided by the environment (for an introduction to this aspect of representation, see, e.g., Intrator, 1993).
The problem of locating X within S is analogous to the problem of determining the exact location of a point on a terrain, which arises in navigation and in the preparation of topographical maps. In topography, this problem can be solved by triangulation: the location of the point is computed from bearings taken to a number of landmarks whose coordinates are known. Likewise, the location of a point in the shape space can be found from its disposition with respect to a number of reference points, known to belong to the same space (``terrain''). This approach leads to a straightforward implementation of representation by second-order isomorphism, as described in the next section.
If the classifier's response also falls off gradually and monotonically with parameter-space distance from the stimulus (the shape on which it has been trained; see Figure 4), it can be used to pinpoint the location of the test stimulus in the shape space, by a process related to triangulation and to nonmetric multidimensional scaling [Edelman, 1995b]. Note that a number of classifiers, each tuned to a different reference point, must be activated (just as in triangulation a number of landmarks must be used for each measurement).
An ensemble or a Chorus [Edelman, 1995b] of k classifiers maps the distal shape space to a proximal representation space, R^k. If the response of each classifier degrades gracefully with the dissimilarity between the test stimulus and the preferred shape, the entire ensemble realizes a mapping M which is smooth and regular. Thus, the distal to proximal mapping is conformal[Note 8] and can therefore serve as a substrate for veridical representation of the original parameter space, as argued in section 3.2.1.
The main reason to use a bank of classifiers rather than raw measurement-space distances to reference points for pinpointing the current stimulus is the possibility to train a classifier to ignore those directions in the measurement space that are irrelevant to the identity of the stimulus (e.g., directions corresponding to changes in the viewpoint parameters z). Connectionist modelers have realized in the past that the response change caused by moving the stimulus away from a stored exemplar should depend on the direction of movement if the space of admissible exemplars is a low-dimensional manifold immersed in the representation space. Specifically, moving along a tangent to that manifold should incur a smaller generalization cost than moving in a direction perpendicular to it. This insight has been incorporated into algorithms that train for invariance by differential reinforcement of stimuli removed in the tangent and the normal directions to the target manifold [Simard et al., 1992]. In Chorus, invariance is not a goal, but rather a precondition that must be fulfilled for the resulting representation to be veridical. Furthermore, absolute invariance is not necessary: it suffices that the structure of categories, as defined by appropriate metrics in the low-dimensional proximal representation space, not be distorted by the irrelevant components of distance, measured along the extraneous dimensions z.
Training classifiers for particular stimuli, as it is done in Chorus, can be interpreted as downplaying the irrelevant dimensions by switching from the measurement-space metrics to representation-space metrics, induced by the class identities [Baxter, 1995]. This property of the space spanned by the outputs of classifiers is important for devising better classification schemes. A typical example is vector quantization --- a representational scheme in which the location of a point in a multidimensional space is coded by the identity of its nearest neighbor, chosen from a small set of points covering the space. In Baxter's (1995) canonical vector quantization, the distances to the covering points are computed according to the classifier metrics, and not the raw vector space metrics.
In comparison to the canonical vector quantization, in Chorus, the primary goal is representation, not classification. Accordingly, the computational question to be addressed is not whether the nearest-neighbor structure makes more sense when measured in the classifier space compared to the measurement space, but rather, to what extent the classifier-space distance structure of an arbitrary set of points reflects the corresponding structure in some low-dimensional distal parametrization. A preliminary empirical exploration indicates that classifier-space distances are indeed likely to behave in the desirable fashion [Edelman and Duvdevani-Bar, 1997a]. The mathematical reason behind this property of Chorus may be its relationship to a powerful method of dimensionality reduction [Bourgain, 1985; Linial et al., 1994], in which points belonging to a multidimensional space are embedded into a space of much lower dimensionality, while preserving to a large extent the original interpoint distances. In Bourgain's embedding of a finite set of points, the locations of the points in the new space are encoded by their distances from randomly chosen subsets of the original set, which serve as reference entities. Distances to reference points are measured in Chorus, too: the response of a classifier trained on a reference pattern constitutes such a measurement, with the added advantage of tuning out the irrelevant dimensions. Thus, the use of classifiers in Chorus makes Bourgain's principle of dimensionality reduction applicable in a situation where ``noise'' dimensions abound.
The major obstacle to be overcome at the basic level is the dependence of the appearance of the stimulus, X, on factors such as illumination and viewpoint, in addition to the category membership index j. If Cj is taken to correspond to the image of a member of j in some canonical orientation, the viewing conditions can be seen to span a view space Vj, which is transverse to the class space C, and pierces it at C=Cj (see Figure 3). A general-purpose function approximation module [Poggio and Edelman, 1990] trained to implement the ``view normalization'' mapping T(j) : Vj -> Cj can perform basic-level categorization because its response can be made largely independent of the viewing conditions.
The required insensitivity of shape-space localization to viewpoint transformations stems from two sources. First, experience shows that hyperacuity can be attained despite considerable random misalignment of the stimulus as a whole, relative to its ``home'' or training pose, probably due to the shallow and overlapping profiles of the individual receptive fields [Poggio et al., 1992]. Second, explicit training for invariance with respect to ``irrelevant'' transformations can complement the inherent tolerance of the receptive-field system. Importantly, once learned from examples, the normalizing transformation T(j) can work even for stimuli not previously encountered by the system, provided that they belong to the same class as the examples used for training. The simplest approach here is to apply to a novel stimulus a transformation that is the average of the normalizing transformations learned for the class to which the stimulus belongs [Lando and Edelman, 1995].
The second task is to characterize a superordinate-level category of the input image, and not merely decide whether it is likely to be the image of a familiar object. This can be done by determining the identities of the prototype modules that respond above some threshold. For example, if, say, the cat, the sheep and the cow modules are the only ones that respond, the stimulus is probably a four-legged animal.
Sa(A,B) ~ sum[Pi(A).Pi(B)] = <P(A), P(B)> (1)
This definition of similarity must, however, be further modified, at least for two reasons. First, Sa is independent of context, whereas perceived similarity depends on the ``contrast set'' against which it is to be judged. Second, Sa is symmetric, whereas human perception of similarity appears to be asymmetric in many cases [Tversky, 1977]. To make Sa depend on the context, one can introduce a vector of weights, one per prototype, such that Wi=Wi(A,B,C,...). Thus, comparing A and B in two contexts, {A, B | C, D, E} and {A, B | F, G, H}, may result in different values of similarity between A and B. To model the asymmetry which frequently arises when subjects are required to estimate the similarity of some stimulus A to another stimulus B, one may observe, following Mumford [1991a], that subjects in this case behave as if they take ``A is similar to B'' to mean ``B is some kind of prototype in a category which includes A. Thus, the stimulus input A being analyzed is treated differently from the memory benchmark B'' [Mumford, 1991a, Medin et al., 1993]. To give B the required distinction, each feature Pi(B) can be weighted in proportion to its long-term saliency sal(Pi, B) in distinguishing between B and the other stimuli.[Note 9] The resulting expression for similarity, which provides for the effects of context and for asymmetry, is
Sa(A,B) ~ sum[Wi.Pi(A).(Pi(B)/sal(Pi,B))] (2)
Note that this definition has the same form as the additive clustering (ADCLUS) similarity measure of [Shepard and Arabie, 1979], which, in turn, instantiates Tversky's (1977) discrete contrast model of feature-based similarity. At the same time, it is built on top of a continuous metric representational substrate -- the shape space spanned by proximities to prototypes. The degree of compromise between these two approaches to similarity may depend on the demands of the task at hand, via the parameters of equation 2. At the one extreme, a Chorus-based system may behave as if it maps the stimuli pertaining to a task into a metric space, with the ensuing symmetric similarity and possible interaction among different dimensions; the other extreme may involve discrete all-or-none features, as in the examples surveyed by Tversky (1977).
In principle, even completely novel shapes can be given a structural description, because the extraction of primitives from images and the determination of spatial relationships is supposed to proceed in a purely bottom-up, or image-driven fashion. In practice, however, both these steps have so far proved impossible to automate, for reasons that may be nonaccidental [Edelman and Weinshall, 1997]. The few computer vision systems currently capable of unconstrained recognition from gray-scale images either ignore the challenge posed by the problems of categorization and of representation of novel objects [Murase and Nayar, 1995], or treat categorization as a byproduct of recognition [Mel, 1997].
In comparison to all these approaches, Chorus treats familiar and novel objects equivalently, as points in a shape space spanned by similarities to a handful of reference objects. The viability of this method is attested to by the pilot implementation of Edelman and Duvdevani-Bar (1997a), which achieved recognition performance on par with that of state of the art computer vision systems despite relying only on shape cues where other systems use shape and color or texture or both [Murase and Nayar, 1995; Mel, 1997; Schiele and Crowley, 1996]. This performance was achieved with a low-dimensional representation (ten dimensions, compared to hundreds in other systems), whose extraction from raw images did not require the problematic computation of a structural description. The use of entire reference objects as high-level features suggests a link between Chorus and the studies of similarity and generalization in feature spaces, carried out by Shepard and others.
Shepard's law of generalization can be implemented in a straightforward manner in a connectionist framework, by constructing tuned units that exhibit radially symmetric exponential decay around the location of the preferred stimulus in a feature space [Hanson and Gluck, 1993,Shepard and Kannappan, 1993]. However, it is rather more interesting computationally to note what happens when the radial ``receptive field'' of an exponential-decay unit is turned into an ellipsoidal one by training the unit to ignore changes along some of the feature-space dimensions. In particular, if viewpoint-related changes in the appearance of a 3D shape to which the unit is tuned come to be ignored (e.g., through learning), the unit becomes a device capable of measuring the shape-space distance between the current stimulus and the optimal one. From here, as we saw in section 4, it is just one step to an implementation of the idea of representation by second-order isomorphism; all one need do is have a number of tuned units acting in parallel.
A computational mechanism that is particularly suitable to implement the tuned units is the regularization network [Poggio and Girosi, 1990]. The simplicity of learning from examples in such networks, and the relatively straightforward way they can be mapped onto the neurobiology of the brain prompted Poggio to revive the old notion of the function of the brain being largely that of a flexible memory, capable of learning from examples and of similarity-based classification (Poggio, 1990; cf. Hebb, 1949; Marr, 1970). It is important to realize, however, that by themselves neither these nor many other learning-based approaches in the literature can solve the problem of representation as posed in the introduction. The reason is that representation is not a problem of associating (whether by learning or otherwise) a proper output with a given input, simply because what counts as ``proper'' differs from task to task (unless the world is represented by its replica, a choice that merely postpones the hard decisions by one stage). Thus, while different views of the same object should clearly be associated with a constant response or mapped into a canonical view [Poggio and Edelman, 1990], there does not seem to be a useful universally valid specification of the proper response to a novel shape, e.g., one that is a parametric blend of two familiar shapes. Consequently, in a representational scheme learning must be augmented by generalization (a process whereby useful responses can be generated for novel stimuli). Thus, Chorus adopts the basic learning strategy by letting units become loosely tuned to certain familiar shape classes (invariantly over dimensions that are irrelevant to shape, such as viewpoint), and it makes the existing tuned units collectively represent novel shapes, in a manner which allows them to be localized in an underlying low-dimensional shape space.
The first problem with the Pandemonium is the choice of all-or-none primitive features, such as edges, corners, etc. This choice, which clearly violates Marr's (1976) principle of least commitment, is likely to lead to the loss of valuable information at an early processing stage; in the framework of section 2, it can be seen to render the distal to proximal mapping non-smooth, lessening the likelihood of veridical representation. This situation can be remedied, if probabilistic features are used instead. According to the probabilistic approach, sensory coding is ``... the process of preparing a representation of the current sensory scene in a form that enables subsequent learning mechanisms to be versatile and reliable'' [Barlow, 1990; Barlow, 1994]. Specifically, a representation is useful for learning if it includes records of recurring and co-occurring events. In Barlow's Probabilistic Pandemonium, the response strength of a demon would be proportional to - log(P), where P is the probability of occurrence of the feature the demon detects (cf. Intrator and Cooper, 1992) .
The second problem with the Pandemonium lies at the level of decision-making (the ``master demon''), where the stimulus is essentially described by the identity of the strongest-responding ``cognitive demon.'' This winner-take-all decision (another violation of the principle of least commitment) does provide some information about the stimulus (namely, the identity of a reference stimulus to which the current one is the most similar), while discarding much more; the representation it provides only qualifies as nearest-neighbor-preserving, according to the terminology of section 3. Chorus improves on this by retaining the responses of a number of cognitive demons.
The notion of ``representation as explanation'' does not contradict the idea that similarities between stimuli are to be represented, although in certain cases, such as scene processing, these two approaches offer largely orthogonal views on the problem of representation. On a conceptual level, the representation of a scene may well be a part of a cognitive schema [Rumelhart, 1980] in which it is embodied, and may therefore be encoded in terms of similarities to related schemata. Perceptually, however, scenes that fit the same schema (e.g., city street) are too diverse for the similarities to be informative, unless the computation of similarity involves explicit alignment of corresponding components [Markman and Gentner, 1993], or ignores shape details altogether. In the latter case, only gross violations of the schema structure, such as the appearance of a sofa levitating above a sidewalk [Biederman et al., 1982], are registered.
With some ingenuity, the theory behind Chorus may actually be interpreted in terms of the idea of representation as explanation. Specifically, the activity of the reference-object modules may be taken to model the probability distribution associated with the structure of the visual stimulus. In the case of single objects, this interpretation does not seem to be too problematic: a stimulus that is attributed both to the camel and the leopard modes in the probability (or explanation) space is simply taken to be a giraffe. In comparison, in the case of scenes (or, more generally, of objects that share common parts, which, in turn, come to be represented independently), an explanation of the stimulus requires an account of the spatial arrangement of the components, and not only of their identities. A natural approach to this problem is suggested by Riesenhuber and Dayan (1997), who propose to combine global configural and local template-like representations in a scheme that is driven by a top-down interpretation process (see also section 9.2).
In addition to dealing with compound objects and scenes, a Chorus-like scheme may benefit from top-down flow of information in deciding which stimuli are to be retained as reference objects, in gathering the statistical salience data for each reference object (section 5), and in control-related chores such as the computation of the target for the next fixation (cf. Koch and Ullman, 1985). By and large, however, Chorus embodies an attempt to find out how far a mostly bottom-up approach to representation can be taken. Perceiving the hidden causes of things is a feat worthy of Sherlock Holmes, and the human visual system seems to be capable of it, given enough time and a challenging task such as separating figure from ground in an underexposed photograph (Mumford, 1994, p.133). In less unique situations, including a variety of controlled experimental conditions, the performance of a perceptual Dr. Watson (``merely'' making sense of the stimulus, as detailed in the next section, instead of accounting for each and every pixel, as expected from a Holmes) seems to be a goal both worthy of pursuit and more readily attainable.
In the domain of shape perception, MDS has been applied in the analysis of perceived similarities among relatively simple 2D figures (rectangles, random irregular polygons), but the most spectacular results have been achieved in two studies that involved more complex shapes. In the first of these studies, subjects were requested to judge (from memory) the pairwise shape similarity of 15 of the US states [Shepard and Chipman, 1970]. The 2D configurations obtained by MDS were surprisingly consistent across subjects, and also made sense geometrically (i.e., states of similar elongation and shape were grouped together). Shepard and Chipman point out that the findings of (1) very much the same configuration whether the states were pictorially displayed or only imagined, along with (2) the relationship, in both cases, between the recovered configuration and the actual cartographic shapes, support the idea of a second-order isomorphism between internal representations and their corresponding external objects.
In the second study, the stimuli (2D closed contours) were created parametrically in such a way that the set of shapes formed a toroidal configuration in the parameter space [Shepard and Cermak, 1973]. The perceived similarities paralleled closely the parameter-space distances among the stimuli. Shepard and Cermak also report some interesting patterns of clustering that subjects imposed on the stimuli when prompted to consider possible categorical labels (such as ``fish'' or ``jet plane'') that could be applied to the (originally unmarked) 2D contours; these findings support the assertion, made in section 2.1, that a metric-space representation of similarity does not contradict the possibility of category-related effects, and, in fact, can provide the requisite substrate for the emergence of those effects.
The veridicality of representation of parametrically defined 3D shapes in human subjects has been tested in two recent studies [Edelman, 1995a; Cutzu and Edelman, 1996]. In each of a series of experiments, which involved pairwise similarity judgment, delayed matching to sample, and long-term memory recall, subjects were confronted with several classes of computer-rendered 3D animal-like shapes, arranged in a complex pattern in a common parameter space. Response time and error rate data were combined into a measure of perceived pairwise shape similarities, and the object to object proximity matrix was submitted to nonmetric MDS. In the resulting solution, the relative geometrical arrangement of the points corresponding to the different objects invariably reflected the complex low-dimensional structure in parameter space that defined the relationships between the stimulus classes (see Figure 6a 6b 6c).[Note 11]
The ability of the subjects to represent the low-dimensional pattern of similarities among stimuli did not extend to nonsense objects, as indicated by the results of control experiments involving ``scrambled'' shapes [Cutzu and Edelman, 1996]. The stimuli in these experiments were obtained by translating the parts of the animal-like shapes to a common center, resulting in star-like nonsense objects. For these objects, the similarity between true and MDS-recovered configurations was consistently lower than for animal-like shapes.
Computer simulations showed that the recovery of the low-dimensional structure from image-space distances between the stimuli was impossible, as expected. In comparison, the psychophysical results were fully replicated by a Chorus-like model, patterned after a higher stage of object processing, in which nearly viewpoint-invariant representations of familiar object classes (but, presumably, not of nonsense objects as in the control experiments; cf. Bulthoff and Edelman, 1992) are available; a rough analogy is to the inferotemporal visual area IT; e.g., see [Young and Yamane, 1992; Tanaka, 1993; Logothetis et al., 1995]. As pointed out in section 4, such a representation of a 3D object can easily be formed if several views of the object are available by training a mechanism such as a radial basis function network to interpolate a characteristic function for the object in the space of all views of all objects [Poggio and Edelman, 1990]. A number of reference objects (in Figure 6a, the corners of the parameter-space CROSS) were chosen, and a separate RBF network was trained to recognize each such object (i.e., to output a constant value for any of its views, encoded by the activities of the underlying receptive field layer; cf. Figure 4). At the RBF level, the similarity between two stimuli was defined as the cosine of the angle between the vectors of outputs they evoked in the RBF modules trained on the reference objects (equation 1). The MDS-derived configurations obtained with this model showed significant resemblance to the true parameter-space configurations (see Figure 6c).
Although reports of cells in the monkey inferotemporal cortex that respond preferentially to faces by now span decades [Gross et al., 1972; Perrett et al., 1989], cells tuned to general objects have only been found recently. In particular, Tanaka and his group reported the desired selectivity for specific (mostly 2D) objects in recordings from the inferotemporal (IT) cortex of anesthetized monkeys [Tanaka et al., 1991; Fujita et al., 1992; Kobatake and Tanaka, 1994; Tanaka, 1992; Tanaka, 1993]. The interpretation of such findings has traditionally been hampered by the unknown nature of the optimal stimuli for the discovered cells: if a cell responds as vigorously to a brush as to a face, it cannot be properly considered a face detector. Rather than attempting the impossible (i.e., ruling out all the stimuli that the cell does not like), Tanaka developed an ingenious method for narrowing down the range of features that are both present in a given stimulus and effective in eliciting a response from the cell. This method has yielded the first evidence of the parallel between the functional organization of the IT cortex, where cells responding to similar shapes are arranged in columns running perpendicular to the cortical surface, and the primary visual cortex, where the columnar structure reflects orientation selectivity and ocular dominance.
Although the columnar organization of the IT cortex has been interpreted in terms of an alphabet of ``elementary'' features, it seems to be equally compatible with the notion that entire objects are represented, as called for by the Chorus model [Tanaka, 1993]. Under this interpretation, the several hundred columns that can be squeezed into the available cortical area correspond to so many classes of ``reference'' stimuli. If the tuning properties of the columns are such that any stimulus likely to be encountered activates a number (say, three or four) columns, the entire system should have a considerable representational power. Moreover, this power would grow if the system were plastic enough to attune itself to novel object classes, as may indeed be the case [Rolls et al., 1989; Kobatake et al., 1992].
More recent data support this interpretation of Tanaka's findings: working with awake monkeys, Logothetis, Pauls and Poggio (1995) reported recordings from cells tuned to specific views of 3D objects (other than faces) on which the monkey had been trained. A small proportion of the object-tuned cells found by Logothetis et al. each responded to a limited subset of the objects, irrespective of view. Together with the previous reports of a hierarchical two-stage approach to (relative) invariance in the face cells [Perrett et al., 1989], these findings suggest that a cell that responds to a certain shape nearly independently of viewpoint (corresponding to a ``prototype'' cell in Chorus) may do so by integrating the responses of several cells each of which prefers another view of the same shape, as suggested in section 4 [Poggio and Edelman, 1990; Edelman and Weinshall, 1991].
None of the above experiments involved parametric manipulation of the stimulus shape --- a crucial component in testing the predictions of the theory of representation proposed here. In another study, where such manipulation was attempted, the stimuli were complex parametrically defined periodic 2D patterns (Sakai, Naya and Miyashita, 1994). In that study, the cellular response was found to decrease monotonically with parameter-space distance between the test stimulus and the preferred pattern to which the cells were tuned. With parametrically controlled 3D stimuli, it should be possible to look for cells that behave similarly to the RBF module whose response is illustrated in Figure 4. The specific predictions are as follows:
This observation is both true and misleading. Stressing the influence of the viewing conditions on the appearance of objects tacitly assumes that it is the exact shape of the object that a representational system should attempt to recover. However, as students of categorization know well, an intelligent agent is much better off representing an object on a number of hierarchical levels of abstraction (with the option of attending to high-resolution details, if the object happens to be present in front of the observer, and if the task demands it), than storing a high-resolution replica of the object, and facing the problem of separating the chaff (pixel-level information) from the wheat (classification information) every time a new instance of that same object class is encountered.
When considered with the goal of proper representation of similarity in mind, the problem of variability of object appearance assumes a somewhat different aspect. At the computational level, instead of seeking absolute invariance with respect to the extraneous view-related parameters, a system can settle for mere tolerance, as determined by the interplay of within- and between-category similarities. At the implementational level, the availability of learning modules that can be trained to compensate for the variability in object appearance shifts the focus from the easier problems in vision (of which invariance seems to be an example) to the more challenging ones, such as making sense of objects not previously seen. The Chorus scheme, built around a theory of representation of similarity, and implemented by a bank of trainable modules tuned to reference objects, embodies both the computational and the implementational-level lessons stated above.
According to the reasoning of Dayan et al., complex underconstrained perceptual tasks require intimate cooperation between bottom-up, or data-driven processes, and top-down, or expectation-driven ones. Their arguments resemble those of other proponents of the Helmholtzian strategy, mentioned briefly in section 6.4, and are related to Grenander's notion of Pattern Theory (opposed to and complementing mere pattern recognition), as recently advocated by Mumford (1994). Returning to the example of Dali's painting, one can observe that people are aware not only of the clocks that appear in it, but also of their twisted and bent shapes. Indeed, making sense of this painting may require knowledge of the possibility of objects bending without losing their identity.[Note 12] The extension of the Chorus framework to deal with this and similar cases will have to await future work; one possible direction that such a development could take would be based on the ideas of class-based processing [Moses et al., 1996; Lando and Edelman, 1995], and of example-directed metamorphosis [Beymer and Poggio, 1996].
1. Objection. ``Knowledge placed in our ideas may be all unreal or chimerical.'' ... If our knowledge of our ideas terminate in them, and reach no further, where there is something further intended, our most serious thoughts will be of little more use than the reveries of a crazy brain ...
2. Answer: ``Not so, where ideas agree with things.''
[Locke, 1690, Book IV (Of Knowledge and Probability), Chapter IV.]
The principle on which Locke based his answer to the grounding problem is that of ``conformity,'' postulated to prevail between the representations and their objects. As is well known, Locke distinguished between simple and complex ideas, each kind with its own grounds for conformity. Consider first the former, somewhat less problematic, kind. The argument here was that ``the idea of whiteness, or bitterness, as it is in the mind, exactly answering that power which is in any body to produce it there, has all the real conformity it can or ought to have, with things without us. And this conformity between our simple ideas and the existence of things, is sufficient for real knowledge.'' (Locke 1690, Book IV, Chapter IV, 4). In terms of feature detectors, this is a statement of belief in the availability of reliable detectors for immediate perceptual qualities.
The finding of cells tuned to well-defined features such as patterns of motion [Movshon et al., 1985; Newsome and Pare, 1988], 2D shapes [Tanaka et al., 1991; Kobatake and Tanaka, 1994], or faces [Gross et al., 1972; Perrett et al., 1982] supports this part of Lockean doctrine, and, in fact, suggests that it may be extended from ``simple'' features to entire objects. The impact of this evidence seems to have been limited by a persistent concern that the feature detectors do not ``really'' detect the features they happen to be tuned to [Dretske, 1981; Fodor, 1987; Cummins, 1989].[Note 13]
Nevertheless, it has been suggested [Albright, 1991] that philosophical worries regarding possibility of Lockean conformity in the functioning of feature detectors found in the brain should be quelled to some extent by the successful manipulation of the organism's perception of a feature through the injection of current in the vicinity of the appropriate detector pool in the cortex [Salzman et al., 1990].
More important, in the light of the possibility of veridical representation of distal changes by proximal ones, as in Shepard's (1968) theory of second-order isomorphism, the philosophical lure of settling the question regarding what this or that individual feature detector ``really'' detects is significantly reduced. Moreover, the problematic distinction between simple and complex ideas suggested by Locke can be given up: in Chorus, the ``feature detectors'' can be tuned to arbitrarily complex objects, yet serve as primitives just as learnable[Note 14] and as immediately perceivable as Locke's simple ideas. At the same time, if second-order isomorphism can be made to work, Locke's ``conformity'' acquires a new concrete meaning: the order and the connection of ideas is identical to the order and the connection of things.[Note 15]
On Fodor's (Rationalist) interpretation of Empiricism, a system equipped with, say, three object-specific modules, tuned to the shapes of a tuna, a cow, and a car, has only three (indivisible) visual concepts: tuna, cow, and car. In fact, however, such a system turns out to be capable of representing a variety of other shapes, some of which are quite unlike the shapes for which dedicated modules are available (cf. Figure 7). Here and elsewhere in cognitive modeling, the logicist approach insists on indivisible primitives and logical connectives, effectively forcing a violation of the principle of least commitment. As a result, logicists cannot but predict a representational capacity that falls far short of the empiricist predictions based on coarse coding, which, in this example, means falling short of the experimental observations. In contrast, if the stimulus is compared simultaneously to a number of graded prototypes, instead of being subjected to a Pandemonium-like all-or-none logical/syntactic analysis, the productivity problem vanishes, along with the premise for Fodor's argument.
Indeed, the idea of second-order isomorphism places the burden of representation where it belongs --- in the world. In Chorus, the ensemble of feature detectors responds (J. J. Gibson would say, resonates) to the environment (while extracting task-specific information), without reconstructing it internally. By merely mirroring proximally the similarity structure of a distal shape space, Chorus embodies the ideas of those philosophers who argued that ``meaning ain't in the head'' [Putnam, 1988] and that ``cognitive systems are largely in the world'' [Millikan, 1995], circumvents the severe difficulties encountered by the reconstructionist approaches in computer vision, and may explain the impressive performance of biological visual systems, which, in any case, appear to be too sloppy to do a good job of reconstructing the world geometrically [O'Regan, 1992]. Thus, in an important sense, Chorus lets the world be its own representation.
A partial solution to this problem is suggested by the realization that the apparent richness of the perceived world is, to a considerable extent, apparent [Dennett, 1991]. The source of this illusion may lie in the immediate availability of the information in the world, which acts as an ``external store'' [O'Regan, 1992].[Note 16] A growing number of psychophysical experiments supports this view [Pollatsek et al., 1984; O'Regan, 1992; Blackmore et al., 1995; Rensink et al., 1995; Grimes, 1995]. In these experiments, subjects are typically found to be unaware of moderate, or, at times, major changes in the visual stimulus during the ``blanking'' period associated with a saccade, or induced artificially, by presenting two stimulus frames in succession, with a short-duration gray-field mask interposed between them. For example, changes such as the disappearance (or the appearance) of pieces of furniture in a room scene, or the sudden growth (by a significant fraction) of the tallest building in a city skyline scene may go unnoticed. This suggests that under normal viewing conditions (i.e., without scrutiny) much less information than previously assumed is taken away from each scene.[Note 17]
While Dennett's insights do reduce the acuteness of the qualia problem to a degree, they do not appear to be able to do away with it. In particular, we are still left with the need to explain why and how a tomato looks round and red to the observer who represents directly only the differences between tomatoes and, say, pears and oranges (as opposed to the shape and the color of the tomato). An explanation here may however be less elusive than commonly thought: an accomplished account of qualia in psychophysiological terms has been formulated recently around the notion of a quality space (analogous to the shape spaces discussed earlier in this paper), reconstructed from an observer's responses using multidimensional scaling [Clark, 1993]. Adding to the thoughts of Carnap and Goodman a great deal of data from psychology and physiology, Clark shows that, in principle, it is not impossible to characterize a perceptual experience in objective terms, starting from relative similarity defined over tuples of objects --- the very notion that constitutes the foundation of the second-order isomorphism theory (see appendix D).
To conclude, let us return to the Riddle of Representation, as posed in the introduction: by virtue of what does the representational state of a human observer seeing a cat on a mat refer to that cat [Cummins, 1989]? A slightly different formulation of this riddle --- what is common to two humans, a robot, and a Martian, who all see a cat on a mat? --- may actually point towards a solution: it seems likely that the only thing that can be common to these four representational systems is the cat itself, sitting ``out there'' on the mat. One way to implement the idea of the world as its own representation is by constructing a system that has at its disposal tunable modules which can be trained to respond to cats or dogs or any other object. Such a system will be representing a cat when it sees one (by virtue of firing of the appropriate modules), and will also be able to dream of a cat or imagine one (if the modules are made to fire in the absence of an immediate sensory stimulation). Moreover, if a selection of modules (not more than a few hundred), each tuned to a different class of stimuli, is available, the system should also be able to represent (through the response of a small subset of the modules at a time) many more stimuli in addition to those actually stored in memory.
Aloimonos, J. Y. (1990). Purposive and qualitative vision. In Proc. AAAI-90 Workshop on Qualitative Vision, pages 1--5, San Mateo, CA. Morgan Kaufmann.
Anderson, C. H. and Van Essen, D. C. (1987). Shifter circuits: a computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Science, 84:6297--6301.
Bajcsy, R. (1988). Active perception. Proc. IEEE, 76(8):996--1005. Special issue on Computer Vision.
Barlow, H. B. (1979). The past, present and future of feature detectors. In Albrecht, D., editor, Recognition of Pattern and Form, volume 44 of Lecture Notes in Biomathematics, pages 4--32. Springer, Berlin.
Barlow, H. B. (1990). Conditions for versatile learning, Helmholtz's unconscious inference, and the task of perception. Vision Research, 30:1561--1571.
Barlow, H. B. (1994). What is the computational goal of the neocortex? In Koch, C. and Davis, J. L., editors, Large-scale neuronal theories of the brain, chapter 1, pages 1--22. MIT Press, Cambridge, MA.
Bartlett, F. C. (1932). Remembering: An Experimental and Social Study. Cambridge University Press, Cambridge.
Baxter, J. (1995). The canonical metric for vector quantization. NeuroCOLT NC-TR-95-047, University of London.
Berkeley, G. (1710/1996). A treatise concerning the principles of human knowledge. Oxford University Press, Oxford.
Beymer, D. and Poggio, T. (1996). Image representations for visual learning. Science, 272:1905--1909.
Biederman, I. (1987). Recognition by components: a theory of human image understanding. Psychol. Review, 94:115--147.
Biederman, I., Mezzanotte, R. J., and Rabinowitz, J. C. (1982). Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology, 14:143--177.
Biederman, I., Rabinowitz, J. C., Glass, A. L., and Stacy, E. W. (1974). On the information extracted from a glance at a scene. Journal of Exp. Psychol, 103:597--600.
Bienenstock, E. and Geman, S. (1995). Compositionality in neural systems. In Arbib, M. A., editor, The handbook of brain theory and neural networks, pages 223--226. MIT Press.
Blackmore, S. J., Brelstaff, G., Nelson, K., and Troscianko, T. (1995). Is the richness of our visual world an illusion? Transsaccadic memory for complex scenes. Perception, 24:1075--1081.
Bookstein, F. L. (1991). Morphometric tools for landmark data: geometry and biology. Cambridge Univ. Press, New York.
Borg, I. and Lingoes, J. (1987). Multidimensional Similarity Structure Analysis. Springer, Berlin.
Bourgain, J. (1985). On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math., 52:46--52.
Brigham, J. C. (1986). The influence of race on face recognition. In Ellis, H. D., Jeeves, M. A., and Newcombe, F., editors, Aspects of face processing, pages 170--177. Martinus Nijhoff, Dordrecht.
Bulthoff, H. H. and Edelman, S. (1992). Psychophysical support for a 2-D view interpolation theory of object recognition. Proceedings of the National Academy of Science, 89:60--64.
Carne, T. K. (1990). The geometry of shape spaces. Proc. Lond. Math. Soc., 61:407--432.
Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., and Rosen, D. B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Trans. on Neural Networks, 3:698--713.
Carpenter, G. A., Grossberg, S., and Rosen, D. B. (1991). Fuzzy ART: An adaptive resonance algorithm for rapid stable classification of analog patterns. In Proc. Intl. Joint Conf. on Neural Networks, pages 411--416.
Cavanagh, P. (1995). Vision is getting easier every day. Perception, 24:1227--1232. guest editorial.
Clark, A. (1993). Sensory qualities. Clarendon Press, Oxford.
Cohn, H. (1967). Conformal mappings on Riemann surfaces. McGraw-Hill, New York.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20:273--297.
Cortese, J. M. and Dyre, B. P. (1996). Perceptual similarity of shapes generated from Fourier Descriptors. Journal of Experimental Psychology: Human Perception and Performance, 22:133--143.
Cummins, R. (1989). Meaning and mental representation. MIT Press, Cambridge, MA.
Cummins, R. (1996). Representations, Targets, and Attitudes. MIT Press, Cambridge, MA.
Cutzu, F. and Edelman, S. (1996). Faithful representation of similarities among three-dimensional shapes in human vision. Proceedings of the National Academy of Science, 93:12046--12050.
Cutzu, F. and Edelman, S. (1997). Representation of object similarity in human vision: psychophysics and a computational model. Vision Research. in press.
Dayan, P., Hinton, G. E., and Neal, R. M. (1995). The Helmholtz Machine. Neural Computation, 7:889--904.
Dennett, D. C. (1991). Consciousness explained. Little, Brown & Company, Boston, MA.
Dretske, F. (1981). Knowledge and the flow of information. MIT Press, Cambridge, MA.
Edelman, S. (1994). Biological constraints and the representation of structure in vision and language. Psycoloquy, 5(57).
Edelman, S. (1995a). Representation of similarity in 3D object discrimination. Neural Computation, 7:407--422.
Edelman, S. (1995b). Representation, Similarity, and the Chorus of Prototypes. Minds and Machines, 5:45--68.
Edelman, S. (1997). Vision reanimated. In Aloimonos, Y., Carlsson, S., and Eklundh, J.-O., editors, Proc. 7th Rosenvn Workshop on Computer Vision. L. Erlbaum, Hillsdale, NJ. forthcoming.
Edelman, S., Bulthoff, H. H., and Bulthoff, I. (1996). Features of the representation space for 3D objects. MPIK-TR 40, Max Planck Institute for Biological Cybernetics.
Edelman, S. and Duvdevani-Bar, S. (1997a). A model of visual recognition and categorization. Phil. Trans. R. Soc. Lond. (B), 352:--. to appear.
Edelman, S. and Duvdevani-Bar, S. (1997b). Similarity, connectionism, and the problem of representation in vision. Neural Computation, 9:701--720.
Edelman, S. and Intrator, N. (1997). Learning as extraction of low-dimensional representations. In Medin, D., Goldstone, R., and Schyns, P., editors, Mechanisms of Perceptual Learning. Academic Press. in press.
Edelman, S. and Weinshall, D. (1991). A self-organizing multiple-view representation of 3D objects. Biological Cybernetics, 64:209--219.
Edelman, S. and Weinshall, D. (1997). Computational approaches to shape constancy. In Walsh, V. and Kulikowski, J., editors, Perceptual constancies: why things look as they do. Cambridge University Press, Cambridge, UK. in press.
Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman and Hall, London.
Ekman, G. and Lindman, R. (1961). Multidimensional ratio scaling and multidimensional similarity. Reports from the Psychological Laboratories 103, University of Stockholm.
Fodor, J. A. (1981). RePresentations. MIT Press, Cambridge, MA.
Fodor, J. A. (1987). Psychosemantics. MIT Press, Cambridge, MA.
Fujita, I., Tanaka, K., Ito, M., and Cheng, K. (1992). Columns for visual features of objects in monkey inferotemporal cortex. Nature, 360:343--346.
Galin, E. and Akkouche, S. (1996). Mitamorphose d'objets tridimensionnels: quelques mithodes d'acciliration. Revue Techniques et Sciences Informatiques, 15:329--350.
Gallistel, C. R. (1990). The organization of learning. MIT Press, Cambridge, MA.
Garbin, C. P. (1990). Visual-touch perceptual equivalence for shape information in children and adults. Perception and Psychophysics, 48:271--279.
Gibson, J. J. (1966). The senses considered as perceptual systems. Houghton Mifflin, Boston, MA.
Goldstone, R. L. (1994). The role of similarity in categorization: providing a groundwork. Cognition, 52:125--157.
Goodman, N. (1977). The structure of appearance. Reidel, Dordrecht.
Gregory, R. L. (1978). Illusions and hallucinations. In Carterette, E. C. and Friedman, M. P., editors, Handbook of Perception, volume IX, pages 337--357. Academic Press, New York, NY.
Gregson, R. A. M. (1975). Psychometrics of similarity. Academic Press, New York.
Gregson, R. A. M. (1988). Nonlinear psychophysical dynamics. Erlbaum, Hillsdale, NJ.
Gregson, R. A. M. and Britton, L. A. (1990). The size-weight illusion in 2D nonlinear psychophysics. Perception and Psychophysics, 48:343--356.
Grimes, J. (1995). On the failure to detect changes in scenes across saccades. In Akins, K., editor, Perception, volume 5 of Vancouver Studies in Cognitive Science, chapter 4. Oxford University Press, New York.
Gross, C. G., Rocha-Miranda, C. E., and Bender, D. B. (1972). Visual properties of cells in inferotemporal cortex of the macaque. J. Neurophysiol., 35:96--111.
Hanson, S. J. and Gluck, M. A. (1993). Spherical units as dynamic consequential regions: implications for attention, competition and categorization. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 656--664. Morgan Kaufmann.
Harnad, S., editor (1987). Categorical Perception: The Groundwork of Cognition. Cambridge University Press, New York.
Harnad, S. (1990). The symbol grounding problem. Physica D, 42:335--346.
Hebb, D. O. (1949). The organization of behavior. Wiley.
Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268:1158--1161.
Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. (1986). Induction: processes of inference, learning, and discovery. MIT Press, Cambridge, MA.
Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurons in the cat's striate cortex. J. Physiol., 148:574--591.
Intrator, N. (1993). Combining Exploratory Projection Pursuit and Projection Pursuit Regression. Neural Computation, 5:443--455.
Intrator, N. and Cooper, L. N. (1992). Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks, 5:3--17.
Jolicoeur, P. and Humphrey, G. K. (1997). Perception of rotated two-dimensional and three-dimensional objects and visual shapes. In Walsh, V. and Kulikowski, J., editors, Perceptual constancies, chapter 10. Cambridge University Press, Cambridge, UK. in press.
Kendall, D. G. (1984). Shape manifolds, Procrustean metrics and complex projective spaces. Bull. Lond. Math. Soc., 16:81--121.
Kendall, D. G. (1989). A survey of the statistical theory of shape. Statistical Science, 4:87--120.
Kobatake, E. and Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71:2269--2280.
Kobatake, E., Tanaka, K., and Tamori, Y. (1992). Long-term learning changes the stimulus selectivity of cells in the inferotemporal cortex of adult monkeys. Neuroscience Research, S17:237.
Koch, C. and Ullman, S. (1985). Selecting one among the many: a simple network implementing shifts in selective visual attention. Human Neurobiology, 4:219--227.
Koenderink, J. J., van Doorn, A. J., and Kappers, A. M. L. (1996). Pictorial surface attitude and local depth comparisons. Perception and Psychophysics, 58:163--173.
Koriat, A. and Goldsmith, M. (1995). Memory metaphors and the laboratory/real-life controversy: correspondence versus storehouse views of memory. Behavior and Brain Sciences. in press.
Krumhansl, C. L. (1978). Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychological Review, 85:445--463.
Krushkal', S. L. (1979). Quasiconformal mappings and Riemann surfaces. Wiley, New York.
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1--27.
Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling. Sage Piblications, Beverly Hills, CA.
Landau, B., Smith, L. B., and Jones, S. (1988). The importance of shape in early lexical learning. Cognitive Development, 3:299--321.
Lando, M. and Edelman, S. (1995). Receptive field spaces and class-based generalization from a single view in face recognition. Network, 6:551--576.
Le, H. and Kendall, D. G. (1993). The Riemannian structure of Euclidean shape spaces: a novel environment for statistics. The Annals of Statistics, 21:1221--1271.
Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., and Pitts, W. H. (1959). What the frog's eye tells the frog's brain. Proc. IRE, 47:1940--1959.
Lindsay, P. H. and Norman, D. A. (1977). Human information processing: an introduction to psychology. Academic Press, New York.
Linial, N., London, E., and Rabinovich, Y. (1994). The geometry of graphs and some of its algorithmic applications. FOCS, 35:577--591.
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285--318.
Locke, J. (1690/1994). An essay concerning human understanding. Modern Library, New York.
Logothetis, N. K., Pauls, J., and Poggio, T. (1995). Shape recognition in the inferior temporal cortex of monkeys. Current Biology, 5:552--563.
Maffei, L. (1978). Spatial frequency channels: neural mechanisms. In Held, R., Leibowitz, H. W., and Teuber, H.-L., editors, Handbook of sensory physiology: Perception, chapter 2, pages 39--68. Springer-Verlag, Berlin.
Markman, A. and Gentner, D. (1993). Structural alignment during similarity comparisons. Cognitive Psychology, 25:431--467.
Marr, D. (1970). A theory for cerebral neocortex. Proceedings of the Royal Society of London B, 176:161--234.
Marr, D. (1976). Early processing of visual information. Phil. Trans. R. Soc. Lond. B, 275:483--524.
Marr, D. (1982). Vision. W. H. Freeman, San Francisco, CA.
Marr, D. and Nishihara, H. K. (1978). Representation and recognition of the spatial organization of three dimensional structure. Proceedings of the Royal Society of London B, 200:269--294.
Medin, D. L., Goldstone, R. L., and Gentner, D. (1993). Respects for similarity. Psychological Review, 100:254--278.
Mel, B. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9:777--804.
Millikan, R. (1995). White Queen Psychology and other essays for Alice. MIT Press, Cambridge, MA.
Moses, Y., Ullman, S., and Edelman, S. (1996). Generalization to novel images in upright and inverted faces. Perception, 25:443--462.
Movshon, J. A., Adelson, E. H., Gizzi, M. S., and Newsome, W. T. (1985). The analysis of moving visual patterns. In Chagas, C., Gattas, R., and Gross, C. G., editors, Pattern Recognition Mechanisms. Vatican Press, Rome.
Mumford, D. (1991a). Mathematical theories of shape: do they model perception? In Geometric methods in computer vision, volume 1570, pages 2--10, Bellingham, WA. SPIE.
Mumford, D. (1991b). On the computational architecture of the neocortex. I. The role of the thalamo-cortical loop. Biological Cybernetics, 65:135--145.
Mumford, D. (1992). On the computational architecture of the neocortex. II. The role of the cortico-cortical loops. Biological Cybernetics, 66:241--251.
Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In Koch, C. and Davis, J. L., editors, Large-scale neuronal theories of the brain, chapter 7, pages 125--152. MIT Press, Cambridge, MA.
Murase, H. and Nayar, S. (1995). Visual learning and recognition of 3D objects from appearance. International Journal of Computer Vision, 14:5--24.
Newsome, W. T. and Pare, E. B. (1988). A selective impairment of motion perception following lesions of the middle temporal visual area (MT). J. Neurosci., 8:2201--2211.
Nosofsky, R. M. (1988). Exemplar-based accounts of relations between classification, recognition, and typicality. Journal of Experimental Psychology: Learning, Memory and Cognition, 14:700--708.
Nosofsky, R. M. (1991). Stimulus bias, asymmetric similarity, and classification. Cognitive Psychology, 23:94--140.
Nosofsky, R. M. (1992). Similarity scaling and cognitive process models. Annual Review of Psychology, 43:25--53.
O'Regan, J. K. (1992). Solving the real mysteries of visual perception: The world as an outside memory. Canadian J. of Psychology, 46:461--488.
Palmer, S. E. (1978). Fundamental aspects of cognitive representation. In Rosch, E. and Lloyd, B. B., editors, Cognition and Categorization, pages 259--303. Erlbaum, Hillsdale, NJ.
Pentland, A. and Sclaroff, S. (1991). Closed--form solutions for physically based shape modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:715--729.
Perrett, D. I., Mistlin, A. J., and Chitty, A. J. (1989). Visual neurones responsive to faces. Trends in Neurosciences, 10:358--364.
Perrett, D. I., Rolls, E. T., and Caan, W. (1982). Visual neurones responsive to faces in the monkey temporal cortex. Exp. Brain Res., 47:329--342.
Phillips, F. and Todd, J. T. (1996). Perception of local three-dimensional shape. J. Exp. Psychol.: HPP, 22:230--944.
Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symposia on Quantitative Biology, LV:899--910.
Poggio, T. and Edelman, S. (1990). A network that learns to recognize three-dimensional objects. Nature, 343:263--266.
Poggio, T., Fahle, M., and Edelman, S. (1992). Fast perceptual learning in visual hyperacuity. Science, 256:1018--1021.
Poggio, T. and Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978--982.
Pollatsek, A., Rayner, K., and Collins, W. E. (1984). Integrating pictorial information across eye movements. J. Exp. Psychol.: General, 113:426--442.
Putnam, H. (1988). Representation and reality. MIT Press, Cambridge, MA.
Quine, W. V. O. (1969). Natural kinds. In Ontological relativity and other essays, pages 114--138. Columbia University Press, New York, NY.
Rensink, R., O'Regan, K., and Clark, J. J. (1995). Image flicker is as good as saccades in making large scene changes invisible. Perception, 24 (suppl.):26--27.
Reshetnyak, Y. G. (1989). Space mappings with bounded distortion, volume 73 of Translations of mathematical monographs. Amer. Math. Soc., Providence, RI.
Riesenhuber, M. and Dayan, P. (1997). Neural models for the part-whole hierarchies. In Jordan, M., editor, Advances in Neural Information Processing 9, pages --. MIT Press. in press.
Rolls, E. T., Baylis, G. C., Hasselmo, M. E., and Nalwa, V. (1989). The effect of learning on the face selective responses of neurons in the cortex in the superior temporal sulcus of the monkey. Exp. Brain Res., 76:153--164.
Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., and Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8:382--439.
Rumelhart, D. E. (1980). Schemata: The building blocks of cognition. In Spiro, R. J., Bruce, B., and Brewer, W. F., editors, Theoretical Issues in Reading and Comprehension. Erlbaum, Hillsdale, NJ.
Sakai, K., Naya, Y., and Miyashita, Y. (1994). Neuronal tuning and associative mechanisms in form representation. Learning and Memory, 1:83--105.
Salzman, C. D., Britten, K. H., and Newsome, W. T. (1990). Cortical microstimulation influences perceptual judgements of motion direction. Nature, 346:174--177.
Schiele, B. and Crowley, J. L. (1996). Object recognition using multidimensional receptive field histograms. In Buxton, B. and Cipolla, R., editors, Proc. ECCV'96, volume 1 of Lecture Notes in Computer Science, pages 610--619, Berlin. Springer.
Schwartz, E. L. (1985). Local and global functional architecture in primate striate cortex: outline of a spatial mapping doctrine for perception. In Rose, D. and Dobson, V. G., editors, Models of the visual cortex, pages 146--157. Wiley, New York, NY.
Selfridge, O. G. (1959). Pandemonium: a paradigm for learning. In The mechanisation of thought processes. H.M.S.O., London.
Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with unknown distance function. part i. Psychometrika, 27(2):125--140.
Shepard, R. N. (1968). Cognitive psychology: A review of the book by U. Neisser. Amer. J. Psychol., 81:285--289.
Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210:390--397.
Shepard, R. N. (1984). Ecological constraints on internal representation: resonant kinematics of perceiving, imagining, thinking, and dreaming. Psychological Review, 91:417--447.
Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237:1317--1323.
Shepard, R. N. and Arabie, P. (1979). Additive clustering: representation of similarities as combinations of discrete overlapping properties. Psychological Review, 86:87--123.
Shepard, R. N. and Cermak, G. W. (1973). Perceptual-cognitive explorations of a toroidal set of free-form stimuli. Cognitive Psychology, 4:351--377.
Shepard, R. N. and Chipman, S. (1970). Second-order isomorphism of internal representations: Shapes of states. Cognitive Psychology, 1:1--17.
Shepard, R. N. and Kannappan, S. (1993). Connectionist implementation of a theory of generalization. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 665--672. Morgan Kaufmann.
Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop -- a formalism for specifying selected invariances in an adaptive network. In Moody, J., Lippman, R., and Hanson, S. J., editors, Neural Information Processing Systems, volume 4, pages 895--903. Morgan Kaufmann, San Mateo, CA.
Snippe, H. P. and Koenderink, J. J. (1992). Discrimination thresholds for channel-coded systems. Biological Cybernetics, 66:543--551.
Spinoza, B. (1677/1981). The Ethics. J. Simon Publisher, Malibu, CA.
Sugihara, T., Edelman, S., and Tanaka, K. (1996). Representation of objective similarity in the monkey. Invest. Ophthalm. Vis. Sci. Suppl. (Proc. ARVO), 37. abstract.
Sundararaman, D. (1980). Moduli, deformations and classifications of compact complex manifolds. Pitman.
Suppes, P., Pavel, M., and Falmagne, J. (1994). Representations and models in psychology. Ann. Rev. Psychol., 45:517--544.
Tanaka, K. (1992). Inferotemporal cortex and higher visual functions. Current Opinion in Neurobiology, 2:502--505.
Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science, 262:685--688.
Tanaka, K., Saito, H., Fukada, Y., and Moriya, M. (1991). Coding visual images of objects in the inferotemporal cortex of the macaque monkey. J. Neurophysiol., 66:170--189.
Tversky, A. (1977). Features of similarity. Psychological Review, 84:327--352.
Tversky, A. and Gati, I. (1978). Studies of similarity. In Rosch, E. and Lloyd, B., editors, Cognition and Categorization, pages 79--98. Erlbaum.
Ullman, S. (1980). Against direct perception. Behavioral and Brain Sciences, 3:373--416.
Ullman, S. (1989). Aligning pictorial descriptions: an approach to object recognition. Cognition, 32:193--254.
Ullman, S. (1995). Sequence-seeking and counter-streams: a model for information flow in the cortex. Cerebral Cortex, 5:1--11.
Vaisala, J. (1971). Lectures on n-dimensional quasiconformal mappings. Number 229 in Lecture Notes in Mathematics. Springer-Verlag, Berlin.
Vaisala, J. (1992). Domains and maps. In Vuorinen, M., editor, Quasiconformal space mappings, number 1508 in Lecture Notes in Mathematics, pages 119--131. Springer-Verlag, Berlin.
von Helmholtz, H. (1856/1964). Unconscious conclusions. In Dember, W. N., editor, Visual perception: the nineteenth century, pages 163--170. Wiley.
Westheimer, G. (1981). Visual hyperacuity. Prog. Sensory Physiol., 1:1--37.
Young, M. P. and Yamane, S. (1992). Sparse population coding of faces in the inferotemporal cortex. Science, 256:1327--1331.
Zorich, V. A. (1992). The global homeomorphism theorem for space quasiconformal mappings. In Vuorinen, M., editor, Quasiconformal space mappings, number 1508 in Lecture Notes in Mathematics, pages 132--148. Springer-Verlag, Berlin.
Perhaps the most straightforward approach to the construction of a low-dimensional shape space is based on the notion of ``landmarks'' -- fiducial points affixed to the object whose location determines the object's shape [Bookstein, 1991]. An orderly study of the geometry of shape spaces defined by locations of points has been initiated only recently, by Kendall (1984, 1989), who pointed out that the notion of a shape must include a specification of the transformations which, by definition, leave the shape invariant. In Kendall's shape spaces, where objects are rigid configurations of points, it is natural to define shape up to the action of the orthogonal group of transformations (that is, rigid motions plus reflection). From this it follows that dissimilarity between two sets of points is to be measured by the Procrustes distance, defined by the sum of squares of residual distances between corresponding points remaining after applying an optimal orthogonal mapping that matches one set to the other [Borg and Lingoes, 1987].
An interesting consequence of allowing for a Procrustes transformation before computing shape-space distance is that it makes the topology of this space nontrivial. Consider the simple example of the space of all triangles in a plane, and a particular member of that space: the equilateral triangle. Start deforming this triangle by moving one of the vertices inwards, along the perpendicular to the opposite side; this deformation corresponds to a movement of the corresponding point in the shape space. At some stage, the chosen vertex will cross over the opposite side (at which point the triangle will degenerate into a line) and will continue moving outwards. Finally, an equilateral triangle will be re-formed; this triangle is a rotated version of the original one, and therefore equivalent to it under the Procrustes metric. Hence continuous movement along a straight line in the triangle-vertex space corresponds to a movement along a closed line in the shape space. It can be shown that this space is also not flat, and contains singularities (one of which is the triangle whose three vertices coincide); furthermore, the local Riemannian metric that takes these properties into account determines a global metric which is identical to the Procrustes distance [Carne, 1990; Le and Kendall, 1993].
In some cases it may be desirable to define shape up to a group of transformations that is less restrictive than the orthogonal group, or, in other words, to allow deformation.[Note 18] In that case, a suitable framework for the definition of a shape space is provided by the theory of Riemann surfaces [Krushkal', 1979]. Specifically, any two surfaces (shapes) of a given genus related by a conformal mapping can be considered as equivalent (belonging to the same class), with a quasiconformal mapping (see appendix B) taking one shape class into another. The resulting shape space (known as the Teichmuller space) has a Riemannian metric, defined by the deviation of the quasiconformal mapping from conformality [Krushkal', 1979]. The Teichmuller space can be parameterized by a small set of real numbers that provide a possible coordinate system for the resulting shape space [Sundararaman, 1980].
A considerably broader class of mappings emerges if the requirement of conformality is replaced by that of quasiconformality. A regular topological mapping is quasiconformal if there exists a constant q, 1<=q<infinity, such that almost any infinitesimally small sphere is transformed into an ellipsoid for which the ratio of the largest semiaxis to the smallest one does not exceed q [Reshetnyak, 1989]. Intuitively, a conformal mapping is locally an isometry (i.e., a rigid motion; see Figure 8a 8b); a quasiconformal mapping is locally affine (i.e., a combination of motion with shearing deformation). Under such a mapping, the ranks of distances between points are preserved approximately, on a small scale (Vaisala, 1992, p.124). The relevance of quasiconformality to the representation of real-world shapes stems from the realization that distance ranks need not be preserved globally, across the entire shape space; they need only be preserved within shape classes (just as the common parametrization that is the basis for the definition of distal similarity is required to hold within, but not to extend across, the boundaries of natural kinds).