Introduction

Motion illusion is among the most impressive visual illusions1. In motion illusions, motion is perceived even when the relative positions of the observer and the observed object are unchanged. The Fraser–Wilcox illusion (FWI; reported in 1979) is a representative example of motion illusion2 and comprises a spiral pattern design of repeating luminance gradients. The direction of illusory motion varies from person to person. Since their first identification, numerous FWI variations have been reported (e.g.3,4,5), and similar to the original design of FWI, they are composed of a basic structure of light and dark gradients, although with greater individual differences in the strength of perceived motion than in the direction of motion.

The mechanism by which illusory motion occurs has long been debated. One possibility is that contrast intensity3,4,6,7 affects the processing speed of neurons and is converted into motion perception. Additionally, the effects of eye movements8 have been discussed, with no clear conclusion reached. Because illusory motion reportedly causes activity in some areas of the cerebrum involved in visual perception6,9,10, cerebral involvement in the mechanism associated with illusory motion has been suggested. Moreover, behavioral experiments suggest the existence of the perception of motion illusions in animals such as rhesus monkeys11, cats12, lions13, guppies, zebrafish14, and fruit flies15. Therefore, studying the motion-perception mechanisms common to animals with developed visual systems has attracted increasing attention.

With the recent development of deep neural networks (DNNs), research on their use as a tool to study brain function has become more active16. Comparisons of findings related to DNNs with the operating principles of the brain represent an ability to analogize psychophysical or physiological processes using computational principles. Given that one of the original motivations for DNN research was an attempt to find the essence of the brain by artificially reproducing the functions of the nervous system, such analogizing continues to be relevant. DNNs have recently been applied in visual-perception research17,18. Additionally, in illusion research, illusion-like phenomena have been reported in DNNs for classical size illusions19, the scintillating grid20, geometric illusions21, color illusions22,23,24, the flash-lag effect22, and gestalt closure25. DNNs allow changes to the structure of a given network and the weights of the connections, which represent alterations that cannot be applied to a living brain.

Our research group has studied motion illusion by attempting to reproduce illusions using DNNs and comparing them with human perception. We previously focused on the relationship between the occurrence of an illusion and the predictive function of the brain26,27. In that study, we constructed a DNN model incorporating predictive coding theory28 as a theoretical model of the cerebrum29,30,31 and trained by first-person-viewed videos27. The DNN model predicted motion in the rotating snakes illusion to a degree similar to that of human perception, suggesting that the DNN model could be used as a tool for studying the subjective perception of motion illusion.

In the previous paper, we analyzed only the rotating snakes illusion as a representative example of motion illusion. In the present study, we analyzed a variety of motion illusions using the DNN model and attempted to generate predictive images using ordinary static image datasets that included photographs and paintings. Additionally, we conducted psychophysical tests on human subjects and compared predictions by the DNN model with the results from human perceptions.

Methods

Deep neural networks

The connection-weight model of a trained DNN (PredNet; written in Chainer) used in this study was identical to a 500K model described previously27. In brief, PredNet is a DNN that predicts future video frames from past time series of video frames. It outputs the predicted image from the convolutional LSTM (Long Short Term Memory) network and proceeds to train it so that the error (mean squared error) between the future real image and the prediction is reduced. The unique feature of this network is that the prediction error, rather than the real image itself, is the input data to the convolutional LSTM network. To train the DNN, we used a video from the First-person Social Interactions Dataset (http://ai.stanford.edu/~alireza/Disney/). The video contains footage of days in the life of eight subjects at the Disney World Resort in Orlando, Florida. The cameras were attached to hats worn by the walking subjects. In other words, PredNet is expected to learn about the spatial-temporal characteristics of the world from first-person information. The connection-weight model used here was obtained by training using 500,000 video frames.

Test images

To test the prediction of the DNN model, we prepared five groups of test image stimuli: motion illusions (n = 300), modern art paintings (n = 300), classic art paintings (n = 300), movable objects of photo pictures (n = 300), and still objects of photo pictures (n = 300). The motion illusions were originally generated by Drs. Akiyoshi Kitaoka32 (299 images) and Eiji Watanabe33 (1 image). The images of art paintings were randomly collected from wikiart (https://www.wikiart.org) according to their classification. The images of photo pictures were collected at random by icrawler34, which is a framework of web crawlers (license = “noncommercial, modify,” and keywords = “car,” “building,” “cat,” etc.), followed by manual classification as “Movable objects” (animals, vehicles, etc.) or “Still objects” (buildings, mountains, etc.). Images were trimmed and scaled down, and the final size of all images was adjusted to 160 × 120 pixels (width × height) to adapt the training images. The five groups of test image stimuli (1500 images in total) were shared as a “Visual Illusions Dataset”35.

Prediction

The DNN model predicted the 22nd image (P1 image) with reference to 21 consecutive images, which were 21 images copied from one test image. The network then predicted the 23rd image (P2 image) with reference to 22 consecutive images, using the P1 image as the 22nd image. The optical flow vectors between the P1 and P2 images were then calculated by the Lucas–Kanade36 and Farneback37 methods using a customized Python program (window size 50, quality level 0.3 for the Lucas-Kanade; window size 10, stride 5 and min_vec 0.01 for the Farneback). The details of the protocol are essentially the same as in the above two papers. In brief, the feature points are extracted sparsely in Lucas–Kanade method and densely in Farneback method. Then, the optical flow between the two images is calculated using the least-squares criterion, starting from the pixels of the feature points. Both methods assume that the flow is essentially constant in a local neighborhood of the pixel of the feature point. Refer to Fig. 4 as an example of the two analysis methods.

Psychophysical experiment

The visual stimuli used in the psychophysical experiment were created by removing them from a photo picture (image A) and a painting (image B) as shown in Fig. 4. The cropped image was duplicated 20 times and combined horizontally, and combined images were transformed into the circle by WarpPolar method which is the polar conversion function in OpenCV (v.4.2.0.32; https://opencv.org). A white rectangular image was concatenated under the cropped image to reduce the effect of distortion caused by deformation. The width w of the white image is the same as the width x of the cropped image, and the height h is obtained from the following equation for the circumference of the circle with radius r = h + y/2, where y is the height of the cropped image:

$$\begin{aligned} n_sx=2\pi r=2\pi \left( h+\frac{y}{2}\right) \end{aligned}$$
(1)

where \(n_s\) is the number of the repetitions (20). The values of h are obtained by the following equation:

$$\begin{aligned} h=\frac{1}{2}\left( \frac{n_sx}{\pi } - y\right) \end{aligned}$$
(2)

Decimal points of the values were rounded down. The size of the cropped image A and B is 9 × 26 pixels and 8 × 18 pixels (width × height), and h is 15 pixels and 16 pixels, respectively. The image size output from the polar conversion function was set to 1024 × 1024 pixels, and output images were used for the psychophysical experiment. For the test of the prediction of the DNN model, these images were further size converted to 120 × 120 pixels and then placed on the center of a white image with a size of 160 × 120 pixels.

The psychophysical experiment was designed based on the method of Hisakata et al.38 and conducted using a program written in Python using OpenGL (v.3.1.5; https://pypi.org/project/PyOpenGL/). The subjects were the authors T.K. and E.W. plus three naïve subjects (n = 5; all healthy subjects with normal vision). The subjects were asked to answer whether they saw the stimuli rotated clockwise (CW) or counterclockwise (CCW) by keyboard input using a two-alternative judgment. The face of each subject was fixed at 50 cm from the screen, and only the right eye was used for viewing. A gazing point with a viewing angle of 1\(^{\circ }\) was established at the center of a white background, and the stimulus with an outer diameter of 7\(^{\circ }\) and an inner diameter of 1\(^{\circ }\) was presented at 12\(^{\circ }\) to the left of the center for 0.5 seconds. The subjects looked at the gazing point and viewed the stimulus with their peripheral vision. When they responded, the next stimulus was played, but the stimulus was designed so that there was a minimum of 1 second between the presentation of the previous stimulus and the playback of the next stimulus.

To quantitatively examine the illusory motion of the stimuli, we intentionally rotated the stimuli and determined the conditions under which the motion perception did not occur. We prepared two types of images: the original image and its left–right reversed version. This was done to counteract the perceived rotation-velocity bias. The intentional stimulus rotational velocities were set to a range of −2.1 to +2.1 \(^{\circ }\)/s, and the velocity intervals were 0.3 \(^{\circ }\) /s. This means that a total of 15 different intentional stimulus rotation velocities were used. To statistically analyze the responses, we presented the same condition 30 times with randomly varying stimulus types and rotation velocities. Since there were 2 types of images, 15 types of velocity, and 30 repetitions, each subject was presented with 900 stimuli in total. From the statistical data obtained by this procedure, we calculated the rotational velocity of the stimulus based on the same probability of receiving an answer that the stimulus was rotating in the CW and CCW directions.

Figure 6 shows the raw data of the psychophysical experiment. The horizontal axis represents the velocity at which the stimulus was intentionally rotated (with CCW as the positive direction), and the vertical axis represents the probability of responding that each stimulus was rotated CCW. Each obtained psychometric curve was fitted using a cumulative Gaussian function to calculate the rotational velocity and the rotation-cancellation velocity when the probability was 0.5. The rotation-cancellation velocity is the velocity required to cancel the rotation of the presented image, and the direction of rotation due to the motion illusion of the image is the velocity multiplied by a minus. Therefore, we used the original and reversed stimuli to calculate the rotational velocity of the stimulus as follows:

$$\begin{aligned} \frac{1}{2}\{\text {(the static rotational velocity of the reversed stimulus) - (the static rotational velocity of the original stimulus)}\} \end{aligned}$$
(3)

Ethics statement

The study protocol was performed according to the Declaration of Helsinki and was approved by the Ethics Committee of the National Institute for Physiological Sciences (permit No. 20A063). The psychological experiments were performed with informed consent of all subjects. Informed consent included permission to disclose the subject’s initials.

Open-source software

All program codes (DNN, optical flow analysis, and psychophysical stimulus presentation software), trained models, and stimulus images were released as open-source software at the following website.

  1. 1.

    DNN: https://doi.org/10.6084/m9.figshare.5483710

  2. 2.

    Optical flow analysis: https://doi.org/10.6084/m9.figshare.5483716

  3. 3.

    Psychophysical experiment: https://github.com/taikob/Motion_Illusion_test

  4. 4.

    Trained model: https://doi.org/10.6084/m9.figshare.11931222

  5. 5.

    Stimulus images: https://doi.org/10.6084/m9.figshare.9878663

Results

Figure 1 shows the examples of optical flow vectors detected in the images predicted by the model against the five stimulus groups using the Lucas–Kanade method. Although relatively large and/or well-aligned optical flow vectors were detected in the predicted images against motion illusions, relatively small optical flows were detected in the images predicted against other groups. The direction of the motion vector detected from the motion illusions agreed with the direction of the illusory motion perceived by humans. Notably, the directed optical flows were detected not only in the illusion of many colors, shapes, and gradients but also in the illusion of simple white triangles (Fig. 1, upper left). As a fundamental property of the methodology, the Lucas–Kanade method extracts objects with a characteristic shape from images as feature points and exploits them as the starting points of the optical flow. Therefore, it was not very meaningful that the eyes and hands were selected, given that they were extracted as feature points and used as the starting point for the optical flow (e.g., Mona Lisa and President Obama).

For quantitative analysis, the frequency rates of the absolute values of the optical flow vectors detected from each image group were evaluated (Top two graphs in Fig. 2) and averages of the absolute values of the optical flow vectors for each image group were generated (Fig. 3). Top two graphs in Fig. 2 shows that there is a noticeable difference in the frequency distribution between the motion illusion images and the rest of the image groups. In particular, near the modes of the non-motion illusion groups, the frequency of motion illusions was much lower than the other groups. The results were as follows: motion illusions, 0.71 ± 0.18 (arbitrary units; Lucas–Kanade) and 0.64 ± 0.034 (Farneback); modern art paintings, 0.052 ± 0.0029 and 0.24 ± 0.021; realistic art paintings, 0.036 ± 0.00088 and 0.092 ± 0.0014; movable object photographs, 0.035 ± 0.0013 and 0.11 ± 0.0020; and still object photographs, 0.037 ± 0.0013 and 0.11 ± 0.0026. These results indicate that larger optical flow vectors were detected in the motion illusion group relative to the other groups. This tendency did not change according to the use of either Lucas–Kanade or Farneback analyses. These findings suggested that the DNN model accurately classified a group of illusory images and others.

However, as shown in Fig. 2, relatively large optical flow vectors were also detected in images from groups other than that including motion illusion, although the number of examples was small. To investigate the cause of such exceptionally large optical flow vectors, two images (one photograph and one painting), in which notably large optical flows were predicted, were identified, and the P1 and P2 images were compared in detail. The first row of Fig. 4 shows that two large optical flows were detected in the area of the building on the left side of the picture using the Lucas–Kanade method, whereas the Farneback method detected a dense optical flow in the same region. Similarly, for the painting, characteristic optical flows were detected on the columns on the left side of the painting (The third row of Fig. 4). Bottom two graphs in Fig. 2 shows the distributions of the absolute values of the optical flow vectors for each image of the photograph and the painting. Most of the optical flows detected in images A and B were far from the frequency rate peaks of the optical flows detected in the photograph (still objects) and painting (realistic arts) image groups (Top two graphs in Fig. 2). The maximum absolute values of the optical flow vectors detected in each image and calculated by each analytical method were 0.68 and 4.0 (Lucas–Kanade and Farneback, respectively) and 0.32 and 1.5, respectively. Figure 5 shows a plot of the brightness values, where an exceptionally large optical flow was detected. Comparing P1 (Fig. 5, green line) with P2 (Fig. 5, blue line) revealed a shift in the patterns of the two brightness distributions. These results indicate that the DNN model incorrectly predicted motion for static images that a human would not recognize as moving.

We then hypothesized that the patterns of exceptionally large optical flows detected in the photograph and the painting might exhibit characteristics of motion illusions. We focused the analysis on most of the motion illusions having a repeating structure. Therefore, the areas where the large optical flows were detected (Fig. 5, boxed regions) were excised and reassembled into circular repeating structures (Fig. 6), followed by the psychophysical experiments using five human subjects. After testing the effect of the reconstructed design on human perception, we found that these artificially created designs rendered a type of motion illusion (Fig. 6). The strength of the detected rotational velocity and relative relationship between images A and B differed among the five subjects (TK: A, −0.36 \(^{\circ }\)/s and B, 0.31 \(^{\circ }\)/s; TS: A, −0.48 \(^{\circ }\)/s and B, 0.96 \(^{\circ }\)/s; and EW: A, −0.18 \(^{\circ }\)/s and B, 0.29 \(^{\circ }\)/s; AT: A, −0.21 \(^{\circ }\)/s and B, 0.41 \(^{\circ }\)/s; YK: A, −0.37 \(^{\circ }\)/s and B, 0.51 \(^{\circ }\)/s). However, for the same design, there was no individual difference in the direction of the detected rotational velocity. The direction of perceptual motion estimated from optical flow analysis (Fig. 4) and luminance analysis (Fig. 5) against DNN predicted images was counterclockwise for both illusion-like designs. This estimated direction of rotation coincided with the direction of perceptual rotation obtained in the psychological experiment for the illusion-like design derived from image A, but not for the illusion-like design derived from image B. This trend was the same for all subjects. Next, each illusion-like design was input into the DNN model, predictive images were generated, and optical flow analysis and luminance distribution analysis were performed (the second and fourth rows in Fig. 4). As a result, large optical flow and luminance shift in the illusion-like design derived from image A were observed, but not rotational motion in one direction. In the illusion-like design derived from image B, only small flows and luminance shifts were observed.

Figure 1
figure 1

Example images from the dataset and detected optical flows calculated using the Lucas–Kanade method. Four images are shown for each image group: motion illusion, modern arts (painting), realistic arts (painting), movable objects (photograph), and still objects (photograph). Each image group comprises a dataset of 300 images. (a) Original images input to the DNN model [image size: 160 × 120 pixels (width × height)]. (b) Detected optical flow vectors in the predicted images. Yellow points represent the starting points of the detected optical flow vectors, and red lines represent their directions and sizes. The length of the red line is 50 times that of the absolute value of the motion vector calculated using the Lucas–Kanade method. (c) The vectors were drawn over the P1 predicted images.

Figure 2
figure 2

Absolute values of the optical flow vectors. The vectors were calculated using the Lucas–Kanade (left 2 graphs) and Farneback (right 2 graphs) methods. Frequency rates of absolute values of the optical flow vectors for each image group are shown in the graphs with the five solid-colored lines. The total number of vectors in each group was standardized to 1 for the frequency rate, and the size of the sampling window for the frequency rate was 0.01. Note that the horizontal axis of the graph is a logarithmic axis. To compare the distributions of all image-derived vectors, large vectors (rarely observed in the non-illusion group) were detected in the two images. The distribution of absolute values of the optical flow vectors for images A and B shown in Fig. 4 are indicated by blue dots (image A) and orange dots (image B). It can be seen that the values derived from image A and image B contain relatively large ones.

Figure 3
figure 3

Average values of the optical flow vectors for each image group. The vectors were calculated using the Lucas–Kanade (left) and Farneback (right) methods. The mean value of the vectors detected in each image was calculated, and the average values and standard errors of 300 images were calculated for each group.

Figure 4
figure 4

Two examples of exceptional image stimuli from which the DNN model detected relatively large optical flow vectors. Image A was derived from photograph images and image B from painting images. Yellow points represent the starting points of the detected optical flow vectors, and red lines represent their mean directions and sizes. The length of the red lines is 50 and 4 times that of the absolute values of the motion vectors calculated using the Lucas–Kanade and Farneback methods, respectively. The original images input to the DNN model are shown on the far left. The optical flow analysis of the illusion-like designs created from a portion of image A and image B (used in the psychophysical experiment shown in Fig. 6) is shown in the column below the analysis of the original images. The far bottom row shows the results of the analysis of the rotating snakes illusion for reference27.

Figure 5
figure 5

Brightness distributions of images A and B. The predicted images (P1 and P2) were calculated using the DNN model following input of the original images. These images were converted into grayscale images using OpenCV and brightness values were calculated. Brightness values in images (a) A and (b) B and where large optical flows were detected (Fig. 4) are plotted [(c) and (d), respectively] along yellow lines. The dotted line represents the original image, the green line represents the P1 image, and the blue line represents the P2 image. The box in each image (in (c) and (d), the red area) was used to create motion-illusion-like designs.

Figure 6
figure 6

The psychophysical experiments. The boxes in images A and B (Fig. 5) were excised to create two motion-illusion-like designs (ring-shaped designs inserted into the figure), followed by psychophysical experiments using five subjects. Each psychometric curve was fitted with the cumulative Gaussian function using the least-squares method. The probability of seeing counterclockwise (CCW) rotation was plotted against the rotational velocity. The Red and blue charts for each subject correspond to the original image and its left–right reversed version. The dots and the curves in the graph represent the raw probability data and best-fit curve for the stimulus, respectively. In the case of the design from image A, the subjects showed a high probability of answering that it was rotating CCW, whereas for image B, subjects showed a high probability of answering that it was rotating CW.

Discussion

In this study, we showed that the DNN model distinguished between an illusion group and other groups of ordinary photographs and paintings, although it occasionally predicted motion in some parts of the ordinary images (Fig. 4). Interestingly, we were able to create new motion illusions from the target portions of these images. It is possible that there existed small unit structures in the motion illusions and that a background involving normal scenery might suppress the occurrence of illusory motion when the unit structures exist alone. We previously suggested the existence of unit structures in our recent study on the rotating snakes illusion and the Fraser–Wilcox illusion39.

The illusion-like designs shown in Fig. 6 were among the first illusions to be discovered with the aid of artificial intelligence. Other examples reported include illusion generators based on an evolutionary algorithm40, a generative adversarial network41, and a statistical model42. In any case, in order to generate illusions, a module that artificially models human vision is essential. It is speculated that the type of illusion that is created is influenced by the type of brain function modeled in the module. In the evolutionary algorithms, motion illusions were generated by the same connection-weight model as in this paper; in the generative adversarial network, color and contrast illusions were generated by CNNs trained on static images; and in statistical models, color illusions were generated by a patch likelihood estimation model trained on static natural images. Such a methodology would not only synthesize new visual illusions useful for vision researchers, but would also provide a new way to study the similarities and differences between artificial models and human visual perception.

Many motion illusions present a “repetition” of unit structures. As noted, we presume that the presence of even one of these unit structures can potentially cause the perception of motion. However, no single unit structure alone can cause the perception of illusory motion, which suggests that local information might lead to the perception of motion only when it is combined with global information. Supporting evidence suggests that a wide range of brain regions, from V1 to MT+, are involved in the perception of motion illusions9, with higher brain regions (e.g., MT+) thought to integrate information from a broader perspective than V1. The DNN model was capable of detecting motion flow in the unit structure embedded in the photographs and the paintings that was not perceived by humans (the first and third rows in Fig. 4). For the illusion-like designs that repeated unit structures extracted from photographs or paintings, humans perceived motion, but not the direction of motion or the relative magnitude of motion predicted by the DNN model (Fig. 6 and the second and fourth rows in Fig. 4). The discordance between the two could indicate that the DNN model is underdeveloped in its ability to integrate global information, which is thought to be performed in higher brain regions and other areas. For artificial perception to be useful for basic research of human perception, further studies are required.