Zero-shot visual reasoning through probabilistic analogical mapping

Human reasoning is grounded in an ability to identify highly abstract commonalities governing superficially dissimilar visual inputs. Recent efforts to develop algorithms with this capacity have largely focused on approaches that require extensive direct training on visual reasoning tasks, and yield limited generalization to problems with novel content. In contrast, a long tradition of research in cognitive science has focused on elucidating the computational principles underlying human analogical reasoning; however, this work has generally relied on manually constructed representations. Here we present visiPAM (visual Probabilistic Analogical Mapping), a model of visual reasoning that synthesizes these two approaches. VisiPAM employs learned representations derived directly from naturalistic visual inputs, coupled with a similarity-based mapping operation derived from cognitive theories of human reasoning. We show that without any direct training, visiPAM outperforms a state-of-the-art deep learning model on an analogical mapping task. In addition, visiPAM closely matches the pattern of human performance on a novel task involving mapping of 3D objects across disparate categories.


Introduction
If a tree had a knee, where would it be?Human reasoners (without any special training) can provide sensible answers to such questions as early as age four, revealing deep understanding of spatial relations across different object categories [1].How such abstraction is possible has been the focus of decades of research in cognitive science, leading to several proposed theories of analogical reasoning [2][3][4][5][6][7].The basic elements of these theories include structured representations involving bindings between entities and relations, together with some mechanism for mapping the elements in one situation (the source) onto the elements in another (the target), based on the similarity of those elements.However, while this work has succeeded in accounting for a range of empirical signatures of human analogy-making, a significant limitation is that the structured representations at the heart of these theories have typically been hand-designed by the modeler, without providing an account of how such representations might be derived from real-world perceptual inputs [8].This concern is particularly notable in the case of visual analogies.Unlike linguistic analogies in which relations are often given as part of the input (via verbs and relational phrases), visual analogies more obviously depend on a mechanism for the eduction of relations [9]: the extraction of relations from non-relational inputs, such as pixels in images.
More recently, a separate line of research inspired by the success of deep learning in computer vision and other areas has aimed to develop neural network algorithms with the ability to solve analogy problems.This work tackles the challenge of solving analogy problems directly from pixel-level inputs; however, it has typically done so by eschewing the structured representations and reasoning operations that characterize cognitive models, focusing instead on end-to-end training directly on visual analogy tasks.These approaches have generally relied on datasets with massive numbers (sometimes more than a million) of analogy problems, and typically yield algorithms that cannot readily generalize to problems with novel content [10][11][12].More recent efforts have attempted to address this limitation, proposing algorithms that can learn from fewer examples or display some degree of out-of-distribution generalization [13][14][15][16].But this work still maintains the basic paradigm of treating analogy problems as a task on which a reasoner will receive at least some direct training.This is in stark contrast to human reasoners, who can often solve analogy problems zero-shot -that is, with no direct training on those problem types.When visual analogy problem sets are administered to a person, 'training' is typically limited to general task instructions with at most one practice problem.This format is critical to the use of analogy problems to measure fluid intelligence [17,18], the ability to reason about novel problems without prior training or exposure.
Here we propose a synthesis that combines the strengths of these two approaches.Our proposed model, visiPAM (visual Probabilistic Analogical Mapping), combines learned representations derived directly from pixel-level or 3D point-cloud inputs, together with a reasoning mechanism inspired by cognitive models of analogy.VisiPAM addresses two key challenges that arise in developing such a synthesis.The first issue is how structured representations might be extracted from unstructured perceptual inputs.VisiPAM employs attributed graph representations that explicitly keep track of entities, relations, and their bindings.We propose two methods for deriving such representations, one for 2D images and one for point-cloud inputs representing 3D objects.Second, representations inferred from real perceptual inputs are necessarily noisy and high-dimensional, posing a challenge for any reasoning mechanism that must operate over them.To address this problem, visiPAM employs a recently proposed Probabilistic Analogical Mapping (PAM) [19] method that can efficiently identify patterns of similarity governing noisy, real-world inputs.We evaluated visiPAM on a part-matching task with naturalistic images, where it outperformed a state-of-the-art deep learning model by a large margin (30% relative reduction in error rate) -despite the fact that, unlike the deep learning model, visiPAM received no direct training on the part-matching task.We also performed a human experiment involving visual analogies between 3D objects from disparate categories (e.g., an analogy between an animal and a man-made object), where we found that visiPAM closely matched the pattern of human variability across conditions, and approached the level of human performance.Together, these results provide a proof-ofprinciple for an approach that weds the representational power of deep learning with the similarity-based reasoning operations that characterize human cognition.

Computational framework
Figure 1 shows an overview of our proposed approach.VisiPAM consists of two core components: 1) a vision module that extracts structured visual representations from perceptual inputs corresponding to a source and a target, and 2) a reasoning module that performs probabilistic mapping over those representations.The resulting mappings can then be used to transfer knowledge from the source to the target, to infer the part labels in the target image based on the part labels in the source.
The inputs to the vision module consisted of either 2D images, or point-cloud representations of 3D objects.The vision module uses deep learning components to extract representations in the form of attributed graphs.These graph representations consist of a set of nodes and directed edges, each of which is associated with a set of attributes (i.e., a vector).In the present work, nodes corresponded to object parts, and edges corresponded to spatial relations, though other arrangements are possible (e.g., nodes could correspond to entire objects in a multi-object scene, and edges could encode other visual or semantic relations in addition to spatial relations).For 2D image inputs, we used iBOT [20] to extract node attributes that captured the visual appearance of each object part.iBOT is a state-of-the-art, self-supervised vision transformer that was pretrained on a masked image modeling task.Masked image modeling shows promise as a suitable pretraining objective for capturing general visual appearance, rather than only information relevant to a particular task, such as object classification.For 3D point-cloud inputs, we used a Dynamic Graph Convolutional Neural Network (DGCNN) [21] trained on a part segmentation task in which each point was labelled according to the object part to which it belonged.We hypothesized that this objective would encourage the intermediate layers of the DGCNN (which were used to generate node embeddings) to represent the local spatial structure of object parts.We evaluated visiPAM on analogies involving 3D objects from either the same superordinate category (e.g., both the source and target were man-made objects) or different superordinate categories (e.g., the source was a man-made object, and the target was an animal).Note that the DGCNN was only trained on part segmentation for man-made objects, not for animals.Edge attributes encoded either the 2D or 3D spatial relations between object parts (see Section 4.2.3 for more details).We denote the node attributes of each graph as o i=1..N , and the edge attributes as r 1..N (N −1) , where N is the number of nodes in the graph.We denote the source graph as G and the target graph as G .
After these representations were extracted by the vision module, we used PAM to identify correspondences between analogous parts in the source and target, based on the pattern of similarity across both parts and their relations.Formally, this corresponds to a graph-matching problem, in which the objective is to identify the mapping M that maximizes the similarity of mapped nodes and edges.PAM adopts Bayesian inference to compute the optimal mapping: where M is an N × N matrix in which each entry M ii corresponds to the strength of the mapping between node i in the source and node i in the target.The posterior in Equation 1 is based on the following likelihood: in which sim(r ij , r i j ) is the cosine similarity between edge r ij (the edge between nodes i and j in the source) and edge r i j (the edge between nodes i and j in the target), and sim(o i , o i ) is the cosine similarity between node o i in the source and node o i in the target.Intuitively, the optimal mapping is one that assigns high strength to similar nodes, and to nodes attached by similar edges.The relative influence of node vs. edge similarity is controlled by the parameter α.In addition, PAM is constrained by a prior p(M) that favors isomorphic (one-to-one) mappings.Given that an exhaustive search over possible mappings was not feasible (we address problems with up to 10 object parts, corresponding to 10! ≈ 3.6 million possible one-to-one mappings), we used a graduated assignment algorithm [22] to efficiently converge on a soft mapping assignment.More details about PAM can be found in Section 4.3.

Analogical mapping with 2D images
We evaluated visiPAM on a part-matching task developed by Choi et al. [23].In this task two images are presented, each together with a set of coordinates corresponding to the locations of various object parts (as in Figure 1).The task is to label the object parts in the target image, based on a comparison with the labelled object parts in the source image.Importantly, the test set for this dataset consists only of object categories not present in the training set, so that it is not possible to solve the task by learning to classify the object parts in the target image directly.Original task in [23] involved evaluation on within-category animal comparisons after training with 37,330 mapping problems.VisiPAM significantly outperformed SSMN, the previous state-of-the-art, despite having no direct training on mapping.VisiPAM also performed well on new problems involving within-category vehicle comparisons, and between-category animal comparisons (e.g., mapping from cat to horse).VisiPAM performed best when mapping was based on both node and edge similarity.'Random' denotes chance performance (determined by average number of part comparisons).
Table 1 shows the results of this evaluation.Accuracy was computed based on the proportion of parts that were mapped correctly across all problems (i.e., the model could receive partial credit for mapping some but not all of the parts correctly in a given image pair).As originally proposed, the part-matching task involved within-category comparisons of animals (comparing either two images of cats, or two images of horses).In that setting, visiPAM significantly outperformed SSMN, resulting in a 30% reduction relative to SSMN's error rate, despite the fact that SSMN, but not visiPAM, received direct training (37,330 problems) on this task.We also developed extensions of this task involving other object categories.These included within-category mapping of vehicles (involving either two images of planes, or two images of cars), and between-category mappings of animals (mapping from an image of a cat to an image of a horse).We found that visiPAM performed comparably well in these task settings.
A key element of our proposed model is that it performs mapping based on the similarity of both object parts and their relations.To determine the importance of this design decision, we performed ablations on either node or edge similarity components, by setting α to either 0 or 1.We found that both of these ablations significantly impaired the performance of visiPAM.This pattern aligns with findings from studies of human analogy-making, which show that human reasoners are typically sensitive to similarity of both entities and relations [24].
Figure 2 shows some examples of the mappings produced by visiPAM.These include some impressive successes, including mapping of a large number of object parts across dramatic differences in background, lighting, pose, and visual appearance (Figure 2a), as well as between objects of different categories (Figure 2b).They also give a sense of some of the limitations.Figure 2c shows an example of an error pattern that we commonly observed, in which visiPAM confused corresponding left and right parts (in this case the left and right feet in a comparison of two horses).We also found that visiPAM particularly struggled with mapping images of planes (achieving the model's lowest within-category mapping accuracy of 42.5%), as shown in Figure 2d.This is likely due to the limitation of the 2D spatial relations used in this version of the model, which are particularly ill-suited to objects such as planes that can appear in a wide variety of poses and viewpoints.Thus, while visiPAM shows an impressive ability to perform mappings between complex, real-world images using only these 2D spatial relations, we suspect that accurate 3D spatial knowledge is likely necessary to solve some challenging visual analogy problems at a human level.To address this concern, we next sought to evaluate the performance of visiPAM in the context of 3D visual inputs.

Analogical mapping between 3D objects
We evaluated visiPAM on analogies involving point-cloud representations of 3D objects, and compared visiPAM's performance with the results of a human experiment.In that experiment, we presented human participants with analogy problems consisting either of images from the same superordinate category (e.g., an analogy between a dog and a horse), or two different superordinate categories (e.g., an analogy between a chair and a horse).Human behavioral data was especially important as a performance benchmark in the latter case, since there is not necessarily a well-defined, correct mapping between the parts of the two objects.
Figure 3a shows the analogy task used in the human experiment, which we adapted from a classic task employed by Gentner [1].On each trial, participants received a source image and a target image.The source image had two colored markers placed on the object, while the target image had two markers that appeared in the upper right corner.Participants were instructed to 'move the marker on the top right corner in the target image to the corresponding location that maps to the same-color marker in the source image.' Figure 3b shows two representative examples of human responses.Marker placements from different participants are shown as a heatmap on the target image.When presented with source and target images from the same superordinate categories, there was generally strong agreement between participants.Responses were significantly more variable when participants were presented with source and target images from different superordinate categories (distance from mean marker placement of 32.08 pixels in the different-superordinate-category condition vs. 8.05 pixels in the same-superordinate-category condition; 2 (target category, animal vs. man-made object) X 2 (category consistency, different-vs.same-superordinatecategory) repeated-measures ANOVA, main effect of category consistency F 1,40 = 620.87,p < 0.0001).We also found a significant interaction, such that the effect of category consistency was larger for problems where the target category was an animal vs. a man-made object (F 1,40 = 17.93, p < 0.001).There was no significant main effect of target category.For problems involving mapping between different superordinate categories, responses in some trials were bimodal, with some participants preferring one mapping (e.g., from the back of the chair to the head of the horse), and other participants preferring another mapping (e.g., from the back of the chair to the back of the horse), though the same qualitative effects were still present even after accounting for these separate clusters (main effect of category consistency, F 1,40 = 145.92,p < 0.0001; interaction effect, F 1,40 = 20.99,p < 0.0001; see section 4.5.3 for more details on clustering analysis).
Figure 4 shows some examples of mappings produced by visiPAM for point-cloud representations of the 3D objects used in the human experiment.To apply visiPAM to these problems, we obtained embeddings for each point using the DGCNN, then clustered these points, and assigned each cluster to a node, where the attributes of that node were defined based on the average embedding for all points in the cluster.Edge attributes were defined based on the 3D spatial relations between cluster centers.These results show that visiPAM mostly identified sensible mappings, with a few apparent mismappings.This performance is especially impressive in the case of mappings involving animals, given that no component of visiPAM (including the DGCNN) had previously received any exposure to animals.
To compare the model's performance with human responses, we applied the model to all image pairs used in the experiment, and measured the distance from the marker location predicted by visiPAM to the mean locations of human placements for each pair.Figure 5 shows the results of this comparison.We found that visiPAM both reproduced the qualitative pattern displayed by human mappings across conditions, and also very closely approached the overall level of human performance in this task.As with the human participants, visiPAM's mappings showed greater deviation from the mean human placement when mapping between objects from different vs. same superordinate categories (distance from human mean marker placement of 38.06 pixels in the different-superordinate-category condition, and 18.71 pixels in the same-superordinate-category condition).The size of this difference was also larger when the target category was an animal vs. man-made object, capturing the significant interaction effect observed in the human data.Overall, marker locations of analogous parts predicted by visiPAM were an average of 29 pixels from mean locations of human placements, close to the average human distance to mean locations (20 pixels).Relative to the object sizes (average height of 213 pixels and width of 135 pixels), the model and human distances were very similar.
We also calculated the item-level correlation across all analogy problems between average human distances from mean placement locations and distances of the model predictions from the same mean locations.VisiPAM reliably predicted human responses at the item level (r = 0.64).In addition, as with the results for analogies between 2D images, we found that visiPAM performed best when mapping involved both node and edge similarity.The ability of visiPAM to predict human responses at the item level was impaired both when focusing exclusively on node similarity (r = 0.56 for α = 1), and when focusing exclusively on edge similarity (r = 0.15 for α = 0).

Discussion
We have presented a model capable of identifying abstract commonalities governing rich, high-dimensional visual inputs.To accomplish this, visiPAM integrates representations derived using general-purpose algorithms for representation learning together with a similarity-based reasoning mechanism derived from theories of human cognition.Our experiments with analogical mapping of object parts in 2D images showed that visiPAM outperformed a state-ofthe-art deep learning model, despite receiving no direct training on the analogy task.Our experiment with mapping of 3D object parts showed that, when armed with rich 3D visual representations, visiPAM both approached the overall level of human performance, and matched the pattern of human mappings across conditions.
These results build on other recent work that has employed similarity-based reasoning over representations derived using deep learning [19,25,26].Here, we go beyond those previous models, by presenting (to our knowledge) the first model capable of performing analogical mapping with real-world visual inputs.VisiPAM has commonalities with recently proposed deep learning models that employ an explicit representation of similarity to achieve stronger out-of-distribution generalization in visual reasoning tasks [15,16].These include the Structured Set Matching Network with which we compared visiPAM [23].However, a key difference is that similarity in visiPAM is defined over representations learned through more general-purpose approaches (e.g., self-supervised learning), rather than representations acquired through end-to-end training in the service of a specific task.Similarly, a number of recent deep learning models have incorporated graph-based representations similar to the structured visual representations that we employ [27][28][29], but these models were also trained end-to-end based on a specific task.
A few recent studies have applied a more traditional cognitive model of analogical mapping, the Structure-Mapping Engine (SME), to visual analogy problems involving simple geometric forms [30,31].This work represents an impressive effort, showing that SME is able to solve the majority of problems in the Raven's Standard Progressive Matrices [32], a commonly used measure of fluid intelligence.However, a significant limitation of this approach lies in its traditional symbolic representations, which are derived from image-based inputs using hand-designed algorithms.This characteristic arguably limits the ability of such an approach to address more complex, real-world inputs such as natural images.Other so-called 'neurosymbolic' methods have attempted, with some success, to extract symbolic representations directly from pixel-level inputs, which can then be passed to a task-specific symbolic algorithm [33].However, these approaches may not be able to capture the richness of real-world perceptual inputs, or the wide variety of tasks that human reasoners can perform over those inputs.Here, by contrast, we have specifically sought to employ representations that preserve as much of this perceptual richness as possible (by representing elements as vectors), while also incorporating the structured nature of symbolic representations (by maintaining explicit bindings between entities and relations).This was made possible by using a probabilistic reasoning algorithm capable of accommodating such high-dimensional, uncertain inputs.
Human analogical reasoning is thought to be driven by similarity both at the relational level and at the basic object level [24].This latter influence has often been framed as a deficiency in human reasoning -an inability to achieve the pure abstraction that would presumably be enabled by focusing exclusively on relations.An important finding from the present study is that visiPAM performed best when constrained both by edge (relational level) and node (object level) similarity.This result suggests that the influence of object-level similarity in human analogical reasoning may be driven in part by the fact that it is a useful constraint in the context of complex, real-world analogy-making [34].
One limitation of the present work is the lack of a method for deriving 3D representations (such as pointclouds) directly from the 2D inputs that human reasoners typically receive.Some recent advances may make it possible to take a step in that direction [35], but visiPAM will likely benefit from the rapid pace of innovation in representation learning algorithms.We are particularly excited about the continued development of generalpurpose, self-supervised algorithms (such as the iBOT algorithm that we employed [20]), which we hope will provide increasingly rich representations over which reasoning algorithms such as PAM can operate.Finally, an important next step will be to expand the capabilities of visiPAM to incorporate other basic cognitive elements of analogy, such as the ability to induce more general schematic representations that capture abstract similarity across multiple examples [36].There are many exciting prospects for the continued synergy between rich visual representations and human-like reasoning mechanisms.

Analogical mapping with 2D images
To evaluate visiPAM on analogical mapping with 2D images, we used the Pascal Part Matching (PPM) dataset proposed by Choi et al. [23].This dataset was developed based on the Pascal Part dataset [37], a dataset containing images of common object categories, together with segmentation masks for object parts.PPM consists of images from 6 categories (cow, dog, person, sheep, cat, and horse).Each problem consists of a pair of images from the same category, together with coordinates corresponding to object parts, as well as the labels associated with those parts.The task is to label the parts in the target image based on a comparison with the labeled parts in the source image.Choi et al. trained various models on 37,330 problems consisting of images from 4 object categories (cow, dog, person, and sheep), and tested those models on 9,060 problems consisting of images from 2 distinct object categories (cat and horse).In the present work, we did not train visiPAM on part matching at all.Therefore, we only used the problems in the test set.For these problems, both the source and target image each contained 10 object parts.There were 5,860 problems involving cat images, and 3,200 problems involving horse images.Accuracy was computed based on the proportion of parts that were mapped correctly across all problems (i.e., the model could receive partial credit for mapping some but not all of the parts correctly in a given image pair).Despite the class imbalance between problems involving horses and cats, we computed performance on this dataset as an average over all problems, to be consistent with the method used by Choi et al., and therefore ensure a fair comparison with their SSMN method.The PPM dataset can be downloaded from: https://allenai.github.io/one-shot-part-labeling/ppm/We also created two variants on the PPM dataset.First, we created problems involving between-category comparisons of animals.To do this, we selected the subset of object parts shared by both the cat and horse images (head, neck, torso, left and right ear), and created 3,200 problems in which the source was an image of a cat, and the target was an image of a horse.Second, we created new problems involving within-category comparisons of vehicles.To do so, we used images of cars and planes from the original Pascal Part dataset.We generated markers corresponding to the centroid of the masks for each object part.Then, we identified within-category image pairs for which there were at least two common object parts.Some parts had to be excluded due to a lack of systematic correspondences between images (for instance, wheels are sometimes labelled, but there does not appear to be a consistent numbering scheme that might allow identification of corresponding wheels in two images).This procedure identified 406 mapping problems involving images of cars, and 6,860 problems involving images of planes, with an average of 3.85 parts per problem.Due to this significant class imbalance, we first computed the average performance for cars and the average performance for planes, and then computed an average of these two values.

Analogical mapping with 3D objects
3D object stimuli were selected from two publicly available datasets used in computer vision: ShapeNetPart [38] and a 3D animal dataset ('Animal Pack Ultra 2') from Unreal Engine Marketplace.Nine chairs were selected from the ShapeNetPart dataset (each chair with a different shape), and nine animals from the Animal Pack dataset: horse, buffalo, Cane Corso, sheep, domestic pig, Celtic wolfhound, African elephant, Hellenic hound, and camel.For simulations with visiPAM, the vision module (described in Section 4.2.2) received point-clouds sampled directly from these 3D object models, using the CloudCompare software.For the human experiment, we used the Blender software to render 2D images from the 3D models.The stimulus images were generated using a constant lighting condition, with a gray background.Multiple camera positions were sampled for each object, with 30 • separation between camera angles for depth rotation.Two undergraduate research assistants manually annotated the keypoints (i.e., center locations) of predefined parts on the 3D objects, using the Unreal Engine software.Chair parts included seat, back, and chair legs, while animal parts included spine of torso, head, and legs.
We generated 192 pairs of images.Each image pair included a source image (either a chair or an animal), which was annotated with two markers on two different parts of the object.To generate marker locations for source images, we first rendered the images using the corresponding 3D object models, and then calculated marker locations on the rendered 2D images using a perspective projection for the predefined camera position.In one condition, the source and target images were from the same superordinate object category (e.g., two images of animals).In another animals.About 2,000 points were used to represent each 3D object.To reduce the computation cost in the reasoning module, a clustering algorithm (KMeans++ algorithm [46]) was applied to point embeddings to group the points into 8 clusters.These clusters tended to correspond to a semantically meaningful part of the object.Node attributes were defined based on the average embeddings for each cluster.

Edge embeddings
Edge embeddings were computed based on the spatial relations between object parts.For the analogy problems with 2D images, these spatial relations were computed using the 2D part coordinates.We defined two types of spatial relations, one based on angular distance: and one based on vector difference: where c i and c j are the coordinates of nodes i and j, c 0 are the coordinates for the centroid of the entire object (defined in terms of the segmentation masks for the object parts), cos θ is cosine distance, [, ] indicates concatenation, and max(c 1..N ) − min(c 1..N ) is the elementwise range of the coordinates for all nodes.For each pair of nodes i and j, edge attributes were formed by concatenating r θij and r δij .
For the analogy problems with 3D images, the coordinates of each cluster were computed based on the average of the 3D coordinates for each point in that cluster, and the object centroid was computed based on the average of the 3D coordinates for all points in the object.Edge attributes were computed using the 3D equivalent of r θij (r δij was not used for these problems).

VisiPAM reasoning module: Probabilistic Analogical Mapping (PAM)
As described in Section 2.1, PAM [19] adopts Bayesian inference to compute the optimal mapping between two attributed graphs G and G , as formulated in Equations 1 and 2. The likelihood in Equation 2 incorporates a parameter α that controls the relative influence of node vs. edge similarity, where the mapping is entirely driven by node similarity when α = 1, and is entirely driven by edge similarity when α = 0. Note that the edge similarity and node similarity terms are normalized by the number of edges (N (N − 1)) and the number of nodes (N ) respectively so that they are on the same scale before being multiplied by α.We used a value of α = 0.9 when solving analogy problems with 2D images, and a value of α = 0.5 when solving the analogy problems with 3D point-clouds.
Importantly, PAM also employs a prior that favors isomorphic (one-to-one) mappings: where β is a parameter that controls the strength of the preference for isomorphism (higher values of β correspond to a stronger preference for isomorphism).
To implement the inference in Equation 1, we used a graduated assignment algorithm [22] that minimizes the following energy function (equivalent to maximizing the likelihood in Equation 2, subject to the prior in Equation 5): The algorithm starts with a low value of β, and gradually increases it so as to gradually approximate the one-toone constraint.We used a fixed number of iterations (500) for all simulations, with an initial β value of β 0 = 0.1, and an initial mapping matrix M 0 corresponding to a uniform mapping (such that the strength of each mapping was set to 1/N ).Algorithm 1 provides pseudocode for the mapping algorithm.Note that, in the algorithm, the edge similarity term is normalized by the number of edges per node (2(N − 1)), rather than the total number of edges, since the algorithm computes the compatibility of each potential node-to-node mapping separately (rather than summing across all potential mappings as in Equation 6).
Algorithm 1: Graduated assignment algorithm used in PAM.
We used the final mappings produced by PAM in the following ways.For each target node i , we identified the source node i with the strongest mapping strength.For analogy problems with 2D images, we used this mapping to transfer labels from the source nodes to their corresponding target nodes.For analogy problems with 3D objects, we used this mapping to generate marker locations for the target object.First, we assigned each source marker to one of the clusters (i.e., object parts) in the source object, based on the distance of the 3D coordinates for the marker to the centers of each cluster.Then, we computed the location of the corresponding target marker by mapping the vector from the center of the source cluster to the source marker coordinates onto the center of the corresponding target cluster (as identified by PAM).We then used the camera angle for the images presented to human participants to project these 3D coordinates onto the 2D images for comparison with human marker placements.

Structured Set Matching Network
We compared visiPAM's performance to the results reported by Choi et al. [23] for their proposed Structured Set Matching Network (SSMN).The SSMN has some interesting commonalities, as well as some important differences, with visiPAM.Briefly, the SSMN operates by assigning a score to a specified mapping between source and target parts.This score is based on a combination of: 1) the similarity of the learned embeddings for the mapped parts, 2) a score assigned (by a learned neural network) to spatial relation vectors for mapped parts, 3) a score assigned (again by a learned neural network) to appearance relations for mapped parts, and 4) a hard isomorphism constraint that guarantees only one-to-one mappings are considered.VisiPAM differs from SSMN in the following ways.First, rather than learning representations end-to-end in the service of the part-mapping task, as is done in SSMN, visiPAM employs representations learned in the context of more general-purpose objectives, either self-supervised learning in the case of our experiments with 2D images, or part segmentation in our experiment with 3D objects.Second, whereas SSMN scores relation similarity between source and target using a learned neural network, visiPAM explicitly computes the similarity of mapped relations.Together, these two features allow visiPAM to perform mapping without any direct training, whereas SSMN relies on learned components that have the opportunity to overfit to the specific examples observed during training.Finally, SSMN is designed to assign a score to a prespecified, one-to-one mapping, necessitating a search over deterministic mappings at inference time.VisiPAM, by contrast, employs a continuous relaxation of this search problem that allows it to much more efficiently converge on a soft, but approximately isomorphic, mapping.Five out of the 59 participants were removed from analysis either because they indicated they were not serious, or because they moved fewer than 30% of the markers in the entire experiment.Thirteen additional participants were removed because they did not move any of the markers in at least one of the conditions.Thus, data from a total of 41 participants were included in the analyses.

Procedure
We collected behavioral data for the 3D object mapping task using an online experiment coded in JavaScript.Each participant performed mapping for all 192 image pairs that were used to evaluate visiPAM.The experiment used a 2 (target category, animal vs. man-made object) X 2 (category consistency, different-vs.same-superordinate-category) design.Each condition consisted of 48 trials.On each trial, participants were presented with one image pair, with two colored markers displayed on both the source and target image.For each of the two colored markers, they were asked to 'move the marker on the top right corner in the target image to the corresponding location that maps to the same-color marker in the source image.'If the participant did not think there was an analogy between the two images, they were allowed to move the markers back to the top right corner.No time constraint was imposed.The entire experiment was completed in about 41 minutes on average.On each trial, the exact location of each marker placement was recorded.

Analysis
For our primary analysis, we computed the mean marker placement (in 2D pixel space) for each problem, and we then computed average distance from these mean locations across participants.However, visual inspection of the pattern of marker placements across participants (as in Figure 3b) indicated the potential presence of multiple clusters within the responses to a given problem.Therefore, we used the KMeans++ algorithm [46] to identify these clusters, and performed an additional analysis based on the average distance to the closest cluster mean.

Figure 1 :
Figure 1: Overview of visiPAM.VisiPAM contained two core components, a vision module, and a reasoning module.The vision module received visual inputs in the form of either 2D images, or point-cloud representations of 3D objects, and used deep learning components to extract structured visual representations.These representations took the form of attributed graphs, in which both nodes o1..N (corresponding to object parts) and edges r 1..N (N −1) (corresponding to spatial relations between parts) were assigned attributes.The reasoning module then used Probabilistic Analogical Mapping (PAM) to identify a mapping M from the nodes of the source graph G to the nodes of the target graph G , based on the similarity of mapped nodes and edges.Mappings were probabilistic, but subject to a soft isomorphism constraint (preference for one-to-one correspondences).

Figure 2 :
Figure 2: VisiPAM part mappings with 2D images.Examples of part mappings identified by visiPAM for 2D image pairs.(a) Successful within-category animal mapping involving 10 separate parts.Note the significant variation in visual appearance between source and target.(b) Successful between-category animal mapping.(c) Within-category animal mapping illustrating a common error in which the left and right feet are mismapped.(d) Within-category vehicle mapping.Mapping between images of planes was especially difficult, likely due to significant variation in 3D pose.

aFigure 3 :
Figure 3: Experiment measuring human performance for mapping between 3D objects.(a) Sample stimuli used in human experiment.Participants were instructed to move markers in target image to locations corresponding to same-color markers in source image.Left panel: example trials with source and target images from the same superordinate object category.Right panel: example trials with images from different superordinate categories.(c) Example heatmaps of human marker placements on target images for two comparisons.The intensity of the color indicates the proportion of participants who placed the marker in that location.Source images have been reduced in size for the purpose of illustration.Left panel: marker placements were highly consistent across subjects when source and target came from same superordinate category.Right panel: marker placement showed more variation across participants when source and target came from different superordinate categories.

Figure 4 :
Figure 4: VisiPAM part mappings between 3D objects.Examples of part mappings for 3D objects represented as point-clouds.Colors indicate the mappings identified by visiPAM.(a) Successful mapping between chair and dog.VisiPAM identified sensible mappings (e.g., legs of chair to legs of dog) despite the fact that objects come from different superordinate categories.(b) Successful mapping between two chairs.(c) Mapping between two animals (camel and horse) displaying some sensible part mappings (e.g., head of camel to head of horse), but also some apparent mismappings (e.g., hump of the camel to tail of the horse).(d) Mapping between two chairs similarly displaying some mismappings (e.g., seat of the left chair to stand of the right chair).

Figure 5 :
Figure 5: Comparison of visiPAM with human analogy peformance.Violin plot comparing human placements and visiPAM predictions.Strip plots (small colored dots) indicate average distance of marker locations from the overall human mean, one point for each participant.Thin gray lines show the within-participant differences for the same-vs.differentsuperordinate-category conditions.Violin plots exclude data points greater than 2.5 standard deviations away from mean, though these individual data points are included in strip plots.Black horizontal lines indicate mean human distances in each condition, and error bars indicate standard deviations.Large green dots indicate visiPAM predictions (i.e., the distance between model predicted location and human mean location).Both human and visiPAM mappings were more variable when mapping objects from different superordinate categories.

4. 5
Human experiment4.5.1 Participants Fifty-nine participants (mean age = 20.55 years; 51 female) were recruited from the Psychology Department subject pool at the University of California, Los Angeles.All participants were compensated with course credit.