# 6.4: Invariant Object Recognition in the "What" Pathway

• Contributed by

Object recognition is the defining function of the ventral "what" pathway of visual processing: identifying what you are looking at. Neurons in the inferotemporal (IT) cortex can detect whole objects, such as faces, cars, etc, over a large region of visual space. This spatial invariance (where the neural response remains the same or invariant over spatial locations) is critical for effective behavior in the world -- objects can show up in all different locations, and we need to recognize them regardless of where they appear. Achieving this outcome is a very challenging process, one which has stumped artificial intelligence (AI) researchers for a long time -- in the early days of AI, the 1960's, it was optimistically thought that object recognition could be solved as a summer research project, and 50 years later we are making a lot of progress, but it remains unsolved in the sense that people are still much better than our models. Because our brains do object recognition effortlessly all the time, we do not really appreciate how hard of a problem it is.

Figure $$6.10$$: Why object recognition is hard: things that should be categorized as the same (i.e., have the same output label) often have no overlap in their retinal input features when they show up in different locations, sizes, etc, but things that should be categorized as different often have high levels of overlap when they show up in the same location. Thus, the bottom-up similarity structure is directly opposed to the desired output similarity structure, making the problem very difficult.

The reason object recognition is so hard is that there can often be no overlap at all among visual inputs of the same object in different locations (sizes, rotations, colors, etc), while there can be high levels of overlap among different objects in the same location (Figure 6.10). Therefore, you cannot rely on the bottom-up visual similarity structure -- instead it often works directly against the desired output categorization of these stimuli. As we saw in the Learning Chapter, successful learning in this situation requires error-driven learning, because self-organizing learning tends to be strongly driven by the input similarity structure.

Figure $$6.11$$: Schematic for how multiple levels of processing can result in invariant object recognition, where an object can be recognized at any location across the input. Each level of processing incrementally increases the featural complexity and spatial invariance of what it detects. Doing this incrementally allows the system to appropriately bind together features and their relationships, while also gradually building up overall spatial invariance.

Figure $$6.12$$: Another way of representing the hierarchy of increasing featural complexity that arises over the areas of the ventral visual pathways. V1 has elementary feature detectors (oriented edges). Next, these are combined into junctions of lines in V2, followed by more complex visual features in V4. Individual faces are recognized at the next level in IT (even here multiple face units are active in graded proportion to how similar people look). Finally, at the highest level are important functional "semantic" categories that serve as a good basis for actions that one might take -- being able to develop such high level categories is critical for intelligent behavior -- this level corresponds to more anterior areas of IT.

The most successful approach to the object recognition problem, which was advocated initially in a model by Fukushima (1980), is to incrementally solve two problems over a hierarchically organized sequence of layers (Figure 6.11, Figure 6.12):

• The invariance problem, by having each layer integrate over a range of locations (and sizes, rotations, etc) for the features in the previous layer, such that neurons become increasingly invariant as one moves up the hierarchy.
• The pattern discrimination problem (distinguishing an A from an F, for example), by having each layer build up more complex combinations of feature detectors, as a result of detecting combinations of the features present in the previous layer, such that neurons are better able to discriminate even similar input patterns as one moves up the hierarchy.

The critical insight from these models is that breaking these two problems down into incremental, hierarchical steps enables the system to solve both problems without one causing trouble for the other. For example, if you had a simple fully invariant vertical line detector that responded to a vertical line in any location, it would be impossible to know what spatial relationship this line has with other input features, and this relationship information is critical for distinguishing different objects (e.g., a T and L differ only in the relationship of the two line elements). So you cannot solve the invariance problem in one initial pass, and then try to solve the pattern discrimination problem on top of that. They must be interleaved, in an incremental fashion. Similarly, it would be completely impractical to attempt to recognize highly complex object patterns at each possible location in the visual input, and then just do spatial invariance integration over locations after that. There are way too many different objects to discriminate, and you'd have to learn about them anew in each different visual location. It is much more practical to incrementally build up a "part library" of visual features that are increasingly invariant, so that you can learn about complex objects only toward the top of the hierarchy, in a way that is already spatially invariant and thus only needs to be learned once.

Figure $$6.13$$: Summary of neural response properties in V2, V4, and IT for the macaque monkey, according to both the extent to which the areas respond to complex vs. simple visual features (Smax / MAX column, showing how the response to simple visual inputs (Smax) compares to the maximum response to any visual input image tested (MAX), and the overall size of the visual receptive field, over which the neurons exhibit relatively invariant responding to visual features. For V2, nearly all neurons responded maximally to simple stimuli, and the receptive field sizes were the smallest. For V4, only 50% of neurons had simple responses as their maximal response, and the receptive field sizes increase over V2. Posterior IT increases (slightly) on both dimensions, while anterior IT exhibits almost entirely complex featural responding and significantly larger receptive fields. These incremental increases in complexity and invariance (receptive field size) are exactly as predicted by the incremental computational solution to invariant object recognition as shown in the previous figure. Reproduced from Kobatake & Tanaka (1994).

In a satisfying convergence of top-down computational motivation and bottom-up neuroscience data, this incremental, hierarchical solution provides a nice fit to the known properties of the visual areas along the ventral what pathway (V1, V2, V4, IT). Figure 6.13 summarizes neural recordings from these areas in the macaque monkey, and shows that neurons increase in the complexity of the stimuli that drive their responding, and the size of the receptive field over which they exhibit an invariant response to these stimuli, as one proceeds up the hierarchy of areas. Figure 6.14 shows example complex stimuli that evoked maximal responding in each of these areas, to give a sense of what kind of complex feature conjunctions these neurons can detect.

Figure $$6.14$$: Complex stimuli that evoked a maximal response from neurons in V2, V4, and IT, providing some suggestion for what kinds of complex features these neurons can detect. Most V2 neurons responded maximally to simple stimuli (oriented edges, not shown). Reproduced from Kobatake & Tanaka (1994).

See Ventral Path Data for a more detailed discussion of the data on neural responses to visual shape features in these ventral pathways, including several more data figures. There are some interesting subtleties and controversies in this literature, but the main conclusions presented here still hold.

## Exploration of Object Recognition

Figure $$6.15$$: Set of 20 objects composed from horizontal and vertical line elements used for the object recognition simulation. By using a restricted set of visual feature elements, we can more easily understand how the model works, and also test for generalization to novel objects (object 18 and 19 are not trained initially, and then subsequently trained only in a relatively few locations -- learning there generalizes well to other locations).

Go to Objrec for the computational model of object recognition, which demonstrates the incremental hierarchical solution to the object recognition problem. We use a simplified set of "objects" (Figure 6.15) composed from vertical and horizontal line elements. This simplified set of visual features allows us to better understand how the model works, and also enables testing generalization to novel objects composed from these same sets of features. You will see that the model learns simpler combinations of line elements in area V4, and more complex combinations of features in IT, which are also invariant over the full receptive field. These IT representations are not identical to entire objects -- instead they represent an invariant distributed code for objects in terms of their constituent features. The generalization test shows how this distributed code can support rapid learning of new objects, as long as they share this set of features. Although they are likely much more complex and less well defined, it seems that a similar such vocabulary of visual shape features are learned in primate IT representations.