Auditory Sensory Substitution is Intuitive and Automatic with Texture Stimuli

Millions of people are blind worldwide. Sensory substitution (SS) devices (e.g., vOICe) can assist the blind by encoding a video stream into a sound pattern, recruiting visual brain areas for auditory analysis via crossmodal interactions and plasticity. SS devices often require extensive training to attain limited functionality. In contrast to conventional attention-intensive SS training that starts with visual primitives (e.g., geometrical shapes), we argue that sensory substitution can be engaged efficiently by using stimuli (such as textures) associated with intrinsic crossmodal mappings. Crossmodal mappings link images with sounds and tactile patterns. We show that intuitive SS sounds can be matched to the correct images by naive sighted participants just as well as by intensively-trained participants. This result indicates that existing crossmodal interactions and amodal sensory cortical processing may be as important in the interpretation of patterns by SS as crossmodal plasticity (e.g., the strengthening of existing connections or the formation of new ones), especially at the earlier stages of SS usage. An SS training procedure based on crossmodal mappings could both considerably improve participant performance and shorten training times, thereby enabling SS devices to significantly expand blind capabilities.

were trained for 10 days, 1 hour per day on the vOICe device. Training was performed individually with the same trainer in each session. Training for both the blind and sighted covered basic object localization and recognition, as well as two constancy tasks (rotation and shape constancy). Training with the vOICe device was always performed at a black felt covered table. Each session included the following tasks (in this order): length constancy, orientation constancy, and localization. Data was recorded for each task. These initial training tasks were followed by additional training for the remaining time in the hour. The additional training started with simple object centering and shape identification in the first session, followed by extended length or orientation constancy training in the following sessions. Initial training included: centering in vOICe a white circle on the black-felt-covered board, recognition of simple objects (such as distinguishing a square, triangle, and circle), distinguishing an "L" from a backward L, an upside-down L, and backward and upside-down L (i.e., a 7). Length constancy training involved estimating the lengths of lines at just one orientation angle at a time (such as just 90 degree lines) and the orientation constancy training involved estimating angles with the head at only one tilt.
vOICe Training: The vOICe device Participants used a vOICe device to learn the constancy tasks. The vOICe device uses a camera embedded in a pair of sunglasses or a webcam attached externally to glasses. Sighted participants were requested to close their eyes during training and evaluation, and wore opaque glasses and/or a mask to block direct visual input. The camera provided a live video feed of the environment, and a small portable computer was used to encode the video into sound in real time.
The vOICe software was obtained online at seeingwithsound.com and was used for the video-tosound encoding. 3 The training sessions were video recorded for later data analysis. The participants were informed of the recording and consented to it.
vOICe Training: Orientation constancy task To evaluate orientation constancy, participants were presented with a bar (3 × 30 cm) at 6 different angles (6AFC: 0, 90, 45, −45, 22, or −22 degrees relative to vertical; clockwise rotations correspond to positive angles) with three potential head positions (vertical, tilted left, or tilted right) while using the vOICe device, and then were asked to determine the orientation angle of the bar. The experimenter placed the bar on a black felt covered wall in front of the seated participant and visually estimated each angle position to be presented to the participant.
Participants were told to tilt their head left, right, or vertical (no tilt), and were permitted to determine the head tilt angle that they were most comfortable using in each trial (provided that their head was stationary). One head position was requested for each trial. The subject was seated about 81 cm from the bar to be evaluated. The bar angles and head tilt positions (left, right, or vertical) were randomized for each session with 15 total trials per task performance.
Participants performed the task once per session. No visual or tactile controls were performed.
Feedback was given following each task trial by the experimenter indicating the correct angle of the bar.

vOICe Training: Length constancy task
To evaluate length constancy, participants were presented with 5 lengths of bars (5AFC: 3 cm by either 9, 12, 15, 18, or 21 cm), while the bar was placed in one of four orientations (0, 90, 45, or −45 degrees relative to vertical; clockwise rotations correspond to positive angles). Participants were asked to determine the length of the bar presented independent of the angle that it was presented at. The subject was seated about 81 cm from the bar to be evaluated. Participants first performed the task with the vOICe device (original task) and then with vision (touch for the blind; control task) in each session. The bar lengths and angles were randomized in order for each session, which included 20 trials for each task performance (original and then control).
Feedback was given following each task trial by the experimenter indicating the correct angle and length of the bar.    Figure S2 displays results for the Laplacian of Gaussian edge filter. All correlation analyses calculated the p-value for Pearson's correlation using a Student's t distribution (MATLAB corr function, two-tailed test).

Expt. 1 Image complexity Measures
To examine image complexity we defined complexity by a set of MATLAB edge filter based metrics (Supplementary Table S1). Edge metrics were computed by filtering the images with the edge filter (edge function in MATLAB) or corner filter (cornermetric function in MATLAB), averaging all pixels in each image, and then averaging the set of image results. To demonstrate each edge filter, an example-filtered image is shown in each row of Supplemental Table S1 (the unfiltered image is in edge detector table title row). Four edge filters were tested: Laplacian of Gaussian (filters images with a Laplacian of Gaussian filter, and the looks for zero crossings), Minimum Eigenvalue (minimum eigenvalue method by Shi and Tomasi), Prewitt (indicates edges where the gradient of the image is the maximum), and Canny (calculates the gradient of the image using the derivative of a Gaussian filter and then indicates the local maxima). Further filter details can be found in the MATLAB function details; all filtering used default settings.
Of the four edge filters tested, the best albeit weak correlation between the filter output and naive participants performance was observed for the Laplacian of Gaussian (LOG) edge detector (rho = −0.35, p < 0.09), and the best correlation for the trained participant performance was observed for the Prewitt edge detector (rho = −0.49, p < 0.02) ( Supplementary Fig. S1 shows the LOG results and Supplementary Table S1 shows all results, and examples of each of the filters). Additional metrics such as the number of brightness levels and spatial repetitiveness were also used to test correlation with bimodal matching performance, but generated weaker results (Supplementary Table S1). The weak negative correlation between complexity and matching percent correct, on one hand, indicates that complexity may make images less intuitive to interpret. Perhaps "complexity" can partially mask the crossmodal correspondences or dilute the crossmodally relevant information with unimodal noise. On the other hand, the weakness of correlation may indicate that something else, such as the strength of the crossmodal intrinsic mapping, may be a strong mitigating factor. More importantly, a linear fit to the data indicated a performance above chance at even the largest complexity values we tested, for both naive and trained participants (LOG edge detector, Supplementary Fig. S1). Even the "complex" stimuli such as natural textures elicited a well-above-chance performance, likely due to the direct selection of strong crossmodal mappings (such as coarse to fine spatial frequencies; images in

Expt. 4: Matching Remembered Labels to vOICe Sounds.
The bimodal matching experiments described in Expt. 1-3 demonstrate that participants have the ability to crossmodally match vOICe sounds and images. Nevertheless, it is as yet unclear if this crossmodal matching ability affects more conventional, essentially unimodal (i.e., auditory only) training with the sensory substitution device. In vOICe device training, participants are presented with an object or stimulus, are allowed to explore or listen to it via the device (without vision), and then told a label for an object such as "pencil" or "square". The participant is then asked if they can identify the objects when presented in random order. Our memory task was designed to be the same as this memory-based label task in vOICe training, but with intuitive sensory substitution stimuli instead of real objects. Sighted participants were given a label (1 through 4) to remember for each vOICe sound, and were then asked to recall the label when one of vOICe sounds was played (sounds were presented in random order). To demonstrate the relationship between this memory task (modeled on vOICe training) and crossmodal matching ability, the memory task was performed with the same stimuli as in the bimodal matching task (Expt. 1 detailed above) by encoding the images with vOICe, and the correlation between the two tasks was calculated. Participant performance on the vOICe memory task (chance: 25%) significantly correlated with performance on the crossmodal audiovisual matching task (chance: 33%) with rho = 0.68 (p = 0.002) (Supplementary Fig. S2).
It is both interesting and surprising that the vOICe sounds corresponding to the images that were crossmodally intuitive were also easier to remember in this memory task. The result indicates that both the memory task and the crossmodal matching task reflect the same intuitiveness/intrinsicness of crossmodal mappings. Therefore, intrinsic crossmodal mappings provide a common basis for sensory substitution training as well as adaptive behavior and scene perception in the real world with the device.     subjects that performed more than one task for the set of experiments in the paper.

Replacement Figure References
This section references the images that are shown in the figures in the paper (for the links to the images used in the experiment see the section: "Original Figure References"). Images in Figure 2, 3, and 4 were either generated by N. Stiles or obtained online and modified. The images in Figure 3 and Figure 4 are the same as Figure 2a column 1, 2, and 4, and Figure 2c column 6.
In Figure 2a, image column 1, 2, 3, 5, 6, 7, and 8 were generated by N. Stiles. The images in column 4 partially obtained online from the following sources and the modified by N. Stiles).
In Figure 2b, image column 3, and 7 were generated by N. Stiles. The images in column 1 were obtained online from the following sources and then modified by N. Stiles (images in descending In Figure 2c, image column 1, 3, 5, and 6 were generated by N. Stiles. The images in column 2 were partially obtained online from the following sources and then modified by N. Stiles: leaf texture (generated by N. Stiles) and wood texture

Original Figure References
This section references the images used in the experiments (several cannot be shown in the paper due to copyright restrictions, but can be viewed here: http://neuro.caltech.edu/page/research/texture-images/). Images in Figure 2, 3, and 4 were either generated by N. Stiles or obtained online and modified. The images in Figure 3 and Figure 4 are the same as Figure 2a column 1, 2, and 4, and Figure 2c .column 6.
In Figure 2a, image column 1, 2, 3, 5, 6, 7, and 8 were generated by N. Stiles. The images in column 4 were obtained online from the following sources and the modified by N.