Background & Summary

Machine learning (ML) is increasingly being used to analyze videos of spermatozoa under a microscope for developing computer-aided sperm analysis (CASA) systems1,2. In the last few years, several studies have investigated the use of deep neural networks (DNNs) to automatically determine specific attributes of a semen sample, like predicting the proportion of progressive, non-progressive, and immotile spermatozoa3,4,5,6,7. However, a major challenge with using ML for semen analysis is the general lack of data for training and validation. Only a few open labeled datasets exist (Table 1), with most focus on still-frames of fixed and stained spermatozoa or very short sequences of sperm to analyze the morphology of the spermatozoa.

Table 1 Overview of existing sperm datasets.

In this paper, we present a multi-modal dataset containing videos of spermatozoa with the corresponding manually annotated bounding boxes (localization) and additional clinical information about the sperm providers from the original study8. This dataset is an extension of our previously published dataset VISEM8, which included videos of spermatozoa labeled with quality metrics following the World Health Organization (WHO) recommendations9.

There have been several datasets released related to spermatozoa, for example, Ghasemian et al.10 have published an open sperm dataset called HSMA-DS: Human Sperm Morphology Analysis DataSet with normal and abnormal sperms. Experts annotated different features, namely vacuole, tail, midpiece, and head abnormality. The availability of abnormalities of these features were marked using binary notations such as 1 or 0, 1 is for abnormal, and 0 for normal. In total, there are 1,457 sperm for morphology analysis. These sperm images were captured with ×400 and ×600 magnification. The Modified Human Sperm Morphology Analysis Dataset (MHSMA)11 consists of 1,540 cropped images from the HSMA-DS dataset10. This dataset was collected for analyzing different parts of sperm (morphology). The maximum image size in the dataset is 128 × 128 pixels.

The HuSHEM12 and SCIAN-MorphoSpermGS13 datasets consist of images of sperm heads captured from fixed and stained semen smears. The main purpose of these datasets is sperm morphology classification into five categories, namely normal, tapered, pyriform, small, and amorphous. SMIDS14 is another dataset consisting of 3000 images cropped from 200 stained ocular images from 17 subjects between 19–39 years. From 3000 images, 2027 patches were manually annotated as normal and abnormal. Another 973 samples were classified as non-sperm using spatial-based automated features. McCallum et al.15 have published another similar dataset with bright-field sperm of six healthy participants within 1064 cropped images. The main purpose of this dataset is to find correlations between sperm images obtained by bright field microscopy and sperm DNA quality. However, these datasets do not provide spermatozoa’s motility and kinetics features.

Chen et al.16 introduced a sperm dataset called SVIA (Sperm Videos and Images Analysis dataset), which contains 101 short 1 to 3 seconds video clips and corresponding manually annotated objects. The dataset is divided into three subsets, namely subset-A, B, and C. Subset-A contains 101 video clips (30 FPS) containing 125,000 object locations and corresponding categories. Subset-B contains 10 videos with 451 ground truth segmentation masks and subset-C consists of cropped sperms for classification into 2 categories (impurity images and sperm images). The provided video clips are very short compared to VISEM-Tracking. Our dataset17 contains 7× more annotated video frames. In addition, VISEM-Tracking contains 2.3× more annotated objects compared to SVIA.

VISEM-Tracking offers annotated bounding boxes and sperm tracking information, making it more valuable for training supervised ML models than the original VISEM dataset8, which lacks these annotations. This additional data enables a variety of research possibilities in both biology (e.g., comparing with CASA tracking) and computer science (e.g., object tracking, integrating clinical and tracking data). Unlike other datasets, VISEM-Tracking’s motility features facilitate sperm identification within video sequences, resulting in a richer and more detailed dataset that supports novel research directions. Potential applications include sperm tracking, classifying spermatozoa based on motility, and analyzing movement patterns. To the best of our knowledge, this is the first open dataset of its kind.


The videos for this dataset were originally obtained to study overweight and obesity in the context of male reproductive function18,19. In the study, male participants aged 18 years or older were recruited between 2008 and 2013. Further details on the recruitment can be found in8. The study was approved by the Regional Committee for Medical and Health Research Ethics, South East, Norway (REK number: 2008/3957). All participants provided written informed consent and agreed to the publication of the data. The original project was finished in December 2017, and all data was fully anonymized.

The samples to be recorded were placed on a heated microscope stage (37° C) and examined under a 400× magnification using an Olympus CX31 microscope. The videos were recorded by a microscope-mounted UEye UI-2210C camera made by IDS Imaging Development Systems in Germany. According to the WHO recommendations9 light microscope equipped with phase-contrast optics is necessary for all examinations of unstained preparations of fresh semen. The videos are saved as AVI files. Motility assessment was performed based on the videos following the WHO recommendations9.

The bounding box annotation was performed by data scientists in close collaboration with researchers in the field of male reproduction. The data scientists labeled each video using the tool LabelBox (, which was then verified by the three biologists to ensure that the annotations were correct. Moreover, in addition to the per sperm tracking annotation, we also provide additional labels per spermatozoa, which are: ‘normal sperm’, ‘pinhead’, and ‘cluster’. The pinhead category consists of spermatozoa with abnormally small black heads within the view of the microscope. The cluster category consists of several spermatozoa grouped together. Sample annotations are presented in Fig. 1. The red boxes represent normal spermatozoa which constitute the majority of this dataset and are also biologically most relevant. The green boxes represent sperm clusters where few spermatozoa are clustered together, making it hard to annotate sperm separately. The blue color boxes represent small or pinhead spermatozoa which are smaller than normal spermatozoa and have very small heads compared to a normal sperm head.

Fig. 1
figure 1

Video frames of wet semen preparations with corresponding bounding boxes. Top: large images showing different classes of bounding boxes, red - sperm, green - sperm cluster, and blue - small or pinhead sperm. Bottom: presenting different sperm concentration levels from high to low (from left to right, respectively).

Data Records

VISEM-Tracking is available at Zenodo ( and the license for the data is Creative Commons Attribution 4.0 International (CC BY 4.0). This dataset contains 20 videos (collected from 20 different patients), each with a fixed duration of 30 seconds with the corresponding annotated bounding boxes. The 20 were chosen based on how different they are to all the videos in the dataset in order to obtain as many diverse tracking samples as possible. Since each video from the original dataset lasts for more than 30 seconds we also provide, in addition to the annotated video clips, the remaining video as 166 (30 seconds) video clips for the 20 annotated videos and 336 (30 seconds) video clips for all unlabelled videos of the VISEM dataset8 that were not used to provide tracking information. This was done to make it easy to use for future studies that aim to explore more advanced methods such as semi- or self-supervised learning20.

A length of 30 seconds was chosen to make it easier to annotate and process the video files. These videos can also be used for a possible extension of the tracking data in the future. The splitting process of the long videos is presented in Fig. 2. More details about the dataset itself are summarized in Table 2.

Fig. 2
figure 2

Splitting videos into 30 seconds clips. Green color represents the split used to manually annotate sperms using bounding boxes. Orange color represents the rest of 30s splits included in unlabeled dataset. Purple color section represents the last part of a video which does not have 30s long clip. Therefore, we do not include these endings in our dataset to maintain the consistency of 30s clips.

Table 2 Summary of quantitative information about the VISEM-Tracking dataset.

The folder containing annotated videos has 20 sub-folders with annotations of each video. Each folder of videos has a folder containing extracted frames of the video, a folder containing bounding box labels of each frame, and a folder containing bounding box labels and the corresponding tracking identifiers. In addition to these, a complete video file (.mp4) is provided in the same folder. All bounding box coordinates are given using the YOLO21 format. The folder containing bounding box details with tracking identifiers has ‘.txt‘ files with unique tracking ids to identify individual spermatozoa throughout the video. It is worth noting that the area of the bounding boxes of the same sperm changes over time depending on its position and movement in the videos, as depicted in Fig. 3. Moreover, the text files contain class labels, 0: normal sperm, 1: sperm clusters, and 2: small or pinhead sperm. Additionally, the sperm_counts_per_frame.csv file provides per frame sperm count, cluster count, small_or_pinhead count.

Fig. 3
figure 3

Illustration of how the bounding box area changes over time for a given sperm head.

In most of the labeled videos, each frame contains bounding box information (1,470 frames on average per video). The video titled video_23 has 174 frames without spermatozoa. Furthermore, some videos are recorded at different frame rates (videos video_35 and video_52 have 1,440 total frames, and video video_82 has 1,500 total frames). The distribution of the bounding boxes is reflected in Fig. 4a, and the 2D histogram of the height and width of the bounding boxes is shown in Fig. 4b. Fig. 4a shows that the bounding boxes tend to be evenly distributed across the video frames, with a higher concentration of bounding boxes in the upper left of the video frames. According to Fig. 4b, the variation in bounding box size is quite small.

Fig. 4
figure 4

Statistics about bounding box coordinates and area. (a) - 2D histogram on the center coordinates of the bounding boxes of the sperm class. (b) - 2D histogram on the center coordinates of the bounding boxes of the cluster class. (c) - 2D histogram on the center coordinates of the bounding boxes of the small and pinhead class. (d) - 2D histogram on the height and width (normalized values) of the bounding boxes of sperm class. (e) - 2D histogram of the height and width (normalized values) of the bounding boxes of the cluster class. (f) - 2D histogram on the height and width (normalized values) of the bounding boxes of the small or pinhead class.

In addition to the bounding box details, several .csv files taken from the VISEM dataset8 are provided with additional information. These files include general information about participants (participant_related_data_Train.csv), the standard semen analysis results (semen_analysis_data_Train.csv), serum levels of sex hormones (sex_hormones_Train.csv: measured from blood samples), serum levels of the fatty acids in the phospholipids (fatty_acids_serum_Train.csv: measured from blood samples), fatty acid levels of spermatozoa (fatty_acids_spermatoza_Train.csv). The summary of the content of these files is listed in Table 3.

Table 3 Summary of content of CSV files included in the VISEM-Tracking dataset.

Technical Validation

We divided the 20 videos into a training dataset of 16 videos and a validation dataset of 4 videos (video IDs of the validation dataset are provided in the GitHub repository). The training set was used to train baseline deep learning (DL) models, and the validation dataset was used to evaluate our DL models. YOLOv521 was selected as the baseline sperm detection DNN model. This version of YOLO consists of five different models, namely, YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), and YOLOv5x (XLarge). All models were trained using the training dataset with a number of class parameters of 3, which include normal sperm, cluster, and small or pinhead categories.

In the training process, we provided extracted frames and the corresponding bounding box details to the YOLOv5 models. We set the image size parameter to 640, batch size to 16, and the number of epochs to 300. All other hyperparameters, such as learning rate, batch size, and optimizer, were kept with default values of YOLOv5 ( Furthermore, all experiments were performed on two NVIDIA GeForce RTX 3080 graphic processing units with a total of 20GB memory (10GB per each GPU) with AMD Ryzen 9 3950X 16-Core Processor. The best model was found using the performance on the validation dataset.

Precision, recall, mAP_0.5, mAP_0.5:0.95, and fitness value, as calculated by Jocher et al.21, were used to measure the performance of different YOLOv5 models. The results are listed in Table 4, showing that YOLOv5l performs best with a fitness value of 0.0920. The fitness value presented in the table is calculated using the following equation, which is used in the YOLOv5 implementation to compare model performance.

$$Fitness\_value=\left(0.1\times mAP\_0.5+0.9\times mAP\_0.95\right)$$
Table 4 Different evaluation metrics and corresponding values with the five different YOLOv5 models (mAP = mean average precision).

Samples for visual comparisons of predictions from the five models are shown in Fig. 5. These predictions are from the first frame of the selected four validation videos.

Fig. 5
figure 5

Predicted bounding boxes from the different models created with YOLOv5 for the first frames of the validation data. Video IDs 82, 60, 54 and 52 were used as validation videos.

Usage Notes

To the best of our knowledge, this is the first dataset containing long video clips of human semen samples (30 seconds with 45–50 FPS) that are manually annotated with bounding boxes for each spermatozoon. The performance of our DL experiments for detecting spermatozoa shows that the training data provided in this dataset is diverse and can be used to train advanced DL models.

The data enables different future research directions. For example, it can be used to prepare more labeled data using strategies like semi-supervised learning. Researchers can use the labeled data to train a DL model (such as YOLOv5) and predict bounding boxes for the unlabeled data. Then, those pseudo-labeled data can be passed to the experts in the domain to verify them. This method can make the data annotation process easier and produce accurate labeled datasets faster than manual annotations.

Sperm tracking is necessary to determine sperm dynamics and motility levels. We provide tracking IDs to identify the same spermatozoa throughout the video. Using this data, one can train sperm tracking algorithms, and the results of the tracking algorithms can help to identify different biomedical relevant parameters such as velocity and kinematics. Additionally, it is difficult to determine which spermatozoa in a semen sample have the highest motility, which is of clinical importance. The dataset can be used to train such algorithms for finding spermatozoa with the highest motility.

In addition to the sperm tracking annotations, we also provide additional metadata for the sperm samples. Using this data, researchers can train models that combine the metadata with the tracking information to obtain more accurate predictions of, for example, motility levels.

There is also a growing interest in exploring synthetic data to address data deficiencies and timely and costly data annotation problems in the medical domain22. Researchers can use the dataset to train deep generative models23,24 to generate synthetic data, which then can be used to train other ML models and achieve better generalizable performance. Furthermore, one can train conditional deep generative models25,26 to generate synthetic sperm data with the corresponding ground truth (bounding boxes) using our dataset to overcome the costly problem of getting annotated data.

Another hot topic in AI and medicine is simulating biological organs or creating digital twins. The dataset can, for example, be used to extract features of sperm motility to simulate spermatozoa and their behaviors. Simulations of spermatozoa can potentially lead to more accurate models than current solutions in the field.