VISEM-Tracking, a human spermatozoa tracking dataset

A manual assessment of sperm motility requires microscopy observation, which is challenging due to the fast-moving spermatozoa in the field of view. To obtain correct results, manual evaluation requires extensive training. Therefore, computer-aided sperm analysis (CASA) has become increasingly used in clinics. Despite this, more data is needed to train supervised machine learning approaches in order to improve accuracy and reliability in the assessment of sperm motility and kinematics. In this regard, we provide a dataset called VISEM-Tracking with 20 video recordings of 30 seconds (comprising 29,196 frames) of wet semen preparations with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. In addition to the annotated data, we provide unlabeled video clips for easy-to-use access and analysis of the data via methods such as self- or unsupervised learning. As part of this paper, we present baseline sperm detection performances using the YOLOv5 deep learning (DL) model trained on the VISEM-Tracking dataset. As a result, we show that the dataset can be used to train complex DL models to analyze spermatozoa.


Background & Summary
Machine learning (ML) is increasingly being used to analyze videos of spermatozoa under a microscope for developing computer-assisted sperm analysis (CASA) systems 1,2 .In the last few years, several studies have investigated the use of deep neural networks (DNNs) to automatically determine specific attributes of a semen sample, like predicting the proportion of progressive, non-progressive, and immotile spermatozoa [3][4][5][6][7] .However, a major challenge with using ML for semen analysis is the general lack of data for training and validation.Only a few open labeled datasets exist (Table 1), with most focus on still-frames of fixed and stained spermatozoa or very short sequences of sperm to analyze the morphology of the spermatozoa.
In this paper, we present a multi-modal dataset containing videos of spermatozoa with the corresponding manually annotated bounding boxes (localization) and additional clinical information about the sperm providers from the original study 15 .This dataset is an extension of our previously published dataset VISEM 15 , which included videos of spermatozoa labeled with quality metrics following the World Health Organization (WHO) recommendations 17 .
There have been several datasets related to spermatozoa as follows.For example, Ghasemian et al. 8 have published an open sperm dataset called HSMA-DS: Human Sperm Morphology Analysis DataSet with normal and abnormal sperm cells.Experts annotated different features, namely vacuole, tail, midpiece, and head abnormality.The availability of abnormalities of these features were marked using binary notations such as 1 or 0, 1 is for abnormal, and 0 for normal.In total, there are 1, 457 sperm cells for morphology analysis.These sperm cell images were captured with ×400 and ×600 magnification.The Modified Human Sperm Morphology Analysis Dataset (MHSMA) 9 consists of 1, 540 cropped images from the HSMA-DS dataset 8 .This dataset was collected for analyzing different parts of sperm cells (morphology).The maximum image size in the dataset is 128 × 128 pixels.
The HuSHEM 10 and SCIAN-MorphoSpermGS 11 datasets consist of images of sperm heads captured from fixed and stained semen smears.The main purpose of these datasets is sperm morphology classification into five categories, namely normal, tapered, pyriform, small, and amorphous.SMIDS 12 is another dataset consisting of 3000 images cropped from 200 stained ocular images from 17 subjects between 19 − 39 years.From 3000 images, 2027 patches were manually annotated as normal and abnormal.Another 973 samples were classified as non-sperm using spatial-based automated features.McCallum et al. 13 have published another similar dataset with bright-field sperm cells of six healthy participants within 1064 cropped images.The main purpose of this dataset is to find correlations between sperm cells and DNA quality.However, these datasets do not provide spermatozoa's motility and kinetics features.Our dataset contain 656, 334 annotated objects with tracking details.More details about our dataset is discussed below.
Chen et al. 14 introduced a sperm dataset called SVIA (Sperm Videos and Images Analysis dataset), which contains 101 short 1 to 3 seconds video clips and corresponding manually annotated objects.The dataset is divided into three subsets, namely subset-A, B, and C. Subset-A contains 101 video clips (30 FPS) containing 125, 000 object locations and corresponding categories.Subset-B contains 10 videos with 451 ground truth segmentation masks and subset-C consists of cropped sperms for classification into 2 categories (impurity images and sperm images).The provided video clips are very short compared to VISEM-Tracking.Our dataset 16 contains 7× more annotated video frames.In addition, VISEM-Tracking contains 2.3× more annotated objects compared to SVIA.
VISEM-Tracking offers annotated bounding boxes and sperm tracking information, making it more valuable for training supervised ML models than the original VISEM dataset 15 , which lacks these annotations.This additional data enables a variety of research possibilities in both biology (e.g., comparing with CASA tracking) and computer science (e.g., object tracking, integrating clinical and tracking data).Unlike other datasets, VISEM-Tracking's motility features facilitate sperm identification within video sequences, resulting in a richer and more detailed dataset that supports novel research directions.Potential applications include sperm tracking, classifying spermatozoa based on motility, and analyzing movement patterns.To the best of our knowledge, this is the first open dataset of its kind.

Methods
The videos for this dataset were originally obtained to study overweight and obesity in the context of male reproductive function 18,19 .In the study, male participants aged 18 years or older were recruited between 2008 and 2013 from the normal population.Further details on the recruitment can be found in 15 .The study was approved by the Regional Committee for Medical and Health Research Ethics, South East, Norway (REK number: 2008/3957).All participants provided written informed consent and agreed to the publication of the data.The original project was finished in December 2017, and all data was fully anonymized.
The samples to be recorded were placed on a heated microscope stage (37 • C) and examined under a 400× magnification using an Olympus CX31 microscope.The videos were recorded by a microscope-mounted UEye UI-2210C camera made by IDS Imaging Development Systems in Germany.According to the WHO recommendations 17 light microscope equipped with phase-contrast optics is necessary for all examinations of unstained preparations of fresh semen.The videos are saved as AVI files.Motility assessment was performed based on the videos following the WHO recommendations 17 .
The bounding box annotation was performed by data scientists in close collaboration with researchers in the field of male reproduction.The data scientists labeled each video using the tool LabelBox (https://labelbox.com), which was then verified by the three biologists to ensure that the annotations were correct.Moreover, in addition to the per sperm tracking annotation, we also provide additional labels per spermatozoa, which are: 'normal sperm', 'pinhead', and 'cluster'.The pinhead category consists of spermatozoa with abnormally small black heads within the view of the microscope.The cluster category consists of several spermatozoa grouped together.Sample annotations are presented in Figure 1.The red boxes represent normal spermatozoa cells which constitute the majority of this dataset and are also biologically most relevant.The green boxes represent sperm clusters where few spermatozoa cells are clustered together, making it hard to annotate sperm cells separately.The blue color boxes represent small or pinhead spermatozoa which are smaller than normal spermatozoa and have very small heads compared to a normal sperm head.

Data Records
VISEM-Tracking is available at Zenodo (https://zenodo.org/record/7293726) 16and the license for the data is Creative Commons Attribution 4.0 International (CC BY 4.0).This dataset contains mainly 20 videos (collected from 20 different patients), each with a fixed duration of 30 seconds with the corresponding annotated bounding boxes.The 20 were chosen based on how different they are to all the videos in the dataset in order to obtain as many diverse tracking samples as possible.Since each video from the original dataset lasts for more than 30 seconds we also provide, in addition to the annotated video clips, the remaining video as 166 (30 seconds) video clips for the 20 annotated videos and 336 (30 seconds) video clips for all unlabelled videos of the VISEM dataset 15 that were not used to provide tracking information.This was done to make it easy to use for future studies that aim to explore more advanced methods such as semi-or self-supervised learning 20 .
A length of 30 seconds was chosen to make it easier to annotate and process the video files.These videos can also be used for a possible extension of the tracking data in the future.The splitting process of the long videos is presented in Figure 2.More details about the dataset itself are summarized in Table 2.
The folder containing annotated videos has 20 sub-folders with annotations of each video.Each folder of videos has a folder containing extracted frames of the video, a folder containing bounding box labels of each frame, and a folder containing bounding box labels and the corresponding tracking identifiers.In addition to these, a complete video file (.mp4) is provided in the same folder.All bounding box coordinates are given using the YOLO 21 format.The folder containing bounding box details  Green color represents the split used to manually annotate sperms using bounding boxes.Orange color represents the rest of 30s splits included in unlabeled dataset.Purple color section represents the last part of a video which does not have 30s long clip.Therefore, we do not include these endings in our dataset to maintain the consistency of 30s clips.
with tracking identifiers has '.txt' files with unique tracking ids to identify individual spermatozoa throughout the video.It is worth noting that the area of the bounding boxes of the same sperm changes over time depending on its position and movement in the videos, as depicted in Figure 3.Moreover, the text files contain class labels, 0: normal sperm, 1: sperm clusters, and 2: small or pinhead sperm.Additionally, the sperm_counts_per_frame.csv file provides per frame sperm count, cluster count, small_or_pinhead count.
In most of the labeled videos, each frame contains bounding box information (1, 470 frames on average per video).The  video titled video_23 has 174 frames without spermatozoa.Furthermore, some videos are recorded at different frame rates (videos video_35 and video_52 have 1, 440 total frames, and video video_82 has 1, 500 total frames).The distribution of the bounding boxes is reflected in Figure 4a, and the 2D histogram of the height and width of the bounding boxes is shown in Figure 4b. Figure 4a shows that the bounding boxes tend to be evenly across the video frames, with a higher concentration of bounding boxes in the upper left of the video frames.According to Figure 4b, the variation in bounding box size is quite small.In addition to the bounding box details, several .csvfiles taken from the VISEM dataset 15 , are provided with additional information.These files include general information about participants (participant_related_data_Train.csv), the standard semen analysis results (semen_analysis_data_Train.csv), serum levels of sex hormones (sex_hormones_Train.csv: measured from blood samples), serum levels of the fatty acids in the phospholipids (fatty_acids_serum_Train.csv: measured from blood samples), fatty acid levels of spermatozoa (fatty_acids_spermatoza_Train.csv).The summary of the content of these files is listed in Table 3.

Technical Validation
We divided the 20 videos into a training dataset of 16 videos and a validation dataset of 4 videos (video IDs of the validation dataset are provided in the GitHub repository).The training set was used to train baseline deep learning (DL) models, and the validation dataset was used to evaluate our DL models.YOLOv5 21 was selected as the baseline sperm detection DNN model.This version of YOLO consists of five different models, namely, YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), and YOLOv5x (XLarge).All models were trained using the training dataset with a number of class parameters of 3, which include normal sperm, cluster, and small or pinhead categories.
In the training process, we provided extracted frames and the corresponding bounding box details to the YOLOv5 models.We set the image size parameter to 640, batch size to 16, and the number of epochs to 300.All other hyperparameters, such as learning rate, batch size, and optimizer, were kept with default values of YOLOv5 (https://github.com/ultralytics/yolov5).Furthermore, all experiments were performed on two NVIDIA GeForce RTX 3080 graphic processing units with a total of 20GB memory (10GB per each GPU) with AMD Ryzen 9 3950X 16-Core Processor.The best 5/10 Precision, recall, mAP_0.5, mAP_0.5 : 0.95, and fitness value, as calculated by Jocher et al. 21, were used to measure the performance of different YOLOv5 models.The results are listed in Table 4, showing that YOLOv5l performs best with a fitness value of 0.0920.The fitness value presented in the table is calculated using the following equation, which is used in the YOLOv5 implementation to compare model performance.
Fitness_value = (0.1 × mAP_0.5 + 0.9 × mAP_0.95)Samples for visual comparisons of predictions from the five models are shown in Figure 5.These predictions are from the first frame of the selected four validation videos.

Usage Notes
To the best of our knowledge, this is the first dataset containing long human spermatozoa video clips (30 seconds with 45-50 FPS) that are manually annotated with bounding boxes for each spermatozoon.The performance of our DL experiments for detecting spermatozoa shows that the training data provided in this dataset is diverse and can be used to train advanced DL models.
The data enables different future research directions.For example, it can be used to prepare more labeled data using strategies like semi-supervised learning.Researchers can use the labeled data to train a DL model (such as YOLOv5) and predict bounding boxes for the unlabeled data.Then, those pseudo-labeled data can be passed to the experts in the domain to verify them.This method can make the data annotation process easier and produce accurate labeled datasets faster than manual annotations.
There is also a growing interest in exploring synthetic data to address data deficiencies and timely and costly data annotation problems in the medical domain 22 .Researchers can use the dataset to train deep generative models 23,24 to generate synthetic data, which then can be used to train other ML models and achieve better generalizable performance.Furthermore, one can train conditional deep generative models 25,26 to generate synthetic sperm data with the corresponding ground truth (bounding boxes) using our dataset to overcome the costly problem of getting annotated data.
Another hot topic in AI and medicine is simulating biological organs or creating digital twins.The dataset can, for example, be used to extract features of sperm motility to simulate spermatozoa and their behaviors.Simulations of spermatozoa can potentially lead to more accurate models than current solutions in the field.

Figure 1 .
Figure 1.Video frames of wet semen preparations with corresponding bounding boxes.Top: large images showing different classes of bounding boxes, red -sperm, green -sperm cluster, and blue -small or pinhead sperm.Bottom: presenting different sperm concentration levels from high to low (from left to right, respectively).

Figure 2 .
Figure2.Splitting videos into 30 seconds clips.Green color represents the split used to manually annotate sperms using bounding boxes.Orange color represents the rest of 30s splits included in unlabeled dataset.Purple color section represents the last part of a video which does not have 30s long clip.Therefore, we do not include these endings in our dataset to maintain the consistency of 30s clips.

Figure 3 .
Figure 3. Changing bounding box area over time for the same sperm head.

1 Figure 4 .
Figure 4. Statistics about bounding box coordinates and area.(a) -2D histogram on the center coordinates of the bounding boxes of the sperm class.(b) -2D histogram on the center coordinates of the bounding boxes of the cluster class.(c) -2D histogram on the center coordinates of the bounding boxes of the small and pinhead class.(d) -2D histogram on the height and width (normalized values) of the bounding boxes of sperm class.(e) -2D histogram of the height and width (normalized values) of the bounding boxes of the cluster class.(f) -2D histogram on the height and width (normalized values) of the bounding boxes of the small or pinhead class.

1 Figure 5 .
Predicted bounding boxes from the different models of YOLOv5 for the first frames of the validation data.Video IDs 82, 60, 54 and 52 were used as validation videos.

Table 1 .
Overview of existing sperm datasets.

Table 2 .
Summary of quantitative information about the VISEM-Tracking dataset.

Table 3 .
Summary of content of CSV files included in the VISEM-Tracking dataset.

Table 4 .
Different evaluation metrics and corresponding values with the five different YOLOv5 models.