A dataset for medical instructional video classification and question answering

This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. Toward this, we created the MedVidCL and MedVidQA datasets and introduce the tasks of Medical Video Classification (MVC) and Medical Visual Answer Localization (MVAL), two tasks that focus on cross-modal (medical language and medical video) understanding. The proposed tasks and datasets have the potential to support the development of sophisticated downstream applications that can benefit the public and medical practitioners. Our datasets consist of 6,117 fine-grained annotated videos for the MVC task and 3,010 questions and answers timestamps from 899 videos for the MVAL task. These datasets have been verified and corrected by medical informatics experts. We have also benchmarked each task with the created MedVidCL and MedVidQA datasets and propose the multimodal learning methods that set competitive baselines for future research.


Introduction
One of the key goals of artificial intelligence (AI) is developing a multimodal system that facilitates communication with the visual world (i.e., images, videos) using a natural language query.In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding.Improvements have been made in numerous vision-and-language tasks, such as visual captioning [1,2], visual question answering [3], and natural language video localization [4,5,6].In recent years there has been an increasing interest in video question-answering [7,8] tasks, where given a video, the systems are expected to retrieve the answer to a natural language question about the content in the video.We argue that only predicting the natural language answer does not seem to reflect the real world, where people interact through natural language questions and expect to localize the moments from the videos to answer their questions.The majority of the existing work on video question answering (VQA) focuses on (a) open-domain applications by building the VQA datasets [9,8,10] consisting of movies, TV shows, and games, and (b) retrieval [7,8,9] of the natural language answers.With increasing interest in AI to support clinical decision-making and improve patient engagement [11], there is a need to explore such challenges and develop efficient algorithms for medical language-video understanding.
The recent surge in availability of online videos has changed the way of acquiring information and knowledge.Many people prefer instructional videos to teach or learn how to accomplish a particular task with a series of step-by-step procedures.Medical instructional videos are more suitable and beneficial to deliver the key information through visual and verbal communication at the same time in an effective and efficient manner.Consider the following medical Figure 1: A sample example of a health-related question and its visual answer (temporal segment) from the video.question: "How to place a tourniquet in case of fingertip avulsions?"The textual answer to this question will be hard to understand and act upon without visual aid.To provide visual aid, we first need to identify the relevant video that is medical and instructional in nature.Once we find a relevant video, it is often the case that the entire video can not be considered as the answer to the given question.Instead, we want to refer to a particular temporal segment, or a sequence of moments, from the video, where the answer is being shown or the explanation is being illustrated.The straightforward moment retrieval via an action, object, or attribute keyword may not uniquely identify the relevant temporal segment, which consists of the visual answer to the question.A more natural way to refer to the appropriate temporal segment (c.f.Fig 1) is via natural language question and video segment description, which requires a fine-grained semantic understanding of the video segment, segment description, and question.
We introduce the Medical Video CLassification (MedVidCL) and Medical Video Question Answering (MedVidQA) datasets for medical instructional video classification and question answering.The MedVidCL dataset contains a collection of 6, 617 videos annotated into 'medical instructional', 'medical non-instructional' and 'non-medical' classes.We adopted a two-step approach to construct the MedVidCL dataset.In the first step, we utilize the videos annotated by health informatics experts to train a machine learning model that predicts the given video to one of the three aforementioned classes.In the second step, we only use high-confidence videos and manually assess the model's predicted video category, updating the category wherever needed.The MedVidQA dataset contains the collection of 3, 010 manually created health-related questions and timestamps as visual answers to those questions from trusted video sources, such as accredited medical schools with an established reputation, health institutes, health education, and medical practitioners.We have provided a schematic overview of building the MedVidQA and MedVidCL datasets in Fig 2 and 3, respectively.We benchmarked the datasets by experimenting with multiple algorithms for video classification and video localization.

MedVidQA Data Creation
To create the MedVidQA dataset, we follow a systematic approach that involves the contributions of medical informatics experts at multiple stages.The detailed steps to build the dataset are as follows: 1. Extraction of Medical and Health-related Tasks from WikiHow: With an aim to obtain medical instructional videos that describe how to perform certain health-related activities, we first start by compiling an extensive list of health-related activities using WikiHow2 -an online resource that contains 235, 877 articles on how to do a certain task for a variety of domains ranging from computer and electronics to philosophy and religion, structured in a hierarchy.We start with extracting the medical and health-related tasks from the WikiHow.We collected a total of 6, 761 how-to tasks from the WikiHow 'Heath' category.2. Identification of Relevant Health-related Tasks: In the second step of the dataset creation, we filter the compiled collection extracted from WikiHow.A medical informatics expert reviews each WikiHow task and marks them as relevant or non-relevant from a medical instructional perspective.We keep only those tasks for which the textual answer will be hard to understand and act upon without visual aid such as "how to relieve multiple myeloma pain" or "how to apply cold therapy".This procedure yields 1, 677 medical and health-related instructional tasks.3. Searching YouTube for The Relevant Videos: To acquire the relevant videos, we use the task name as a query to search YouTube via its Data API 3 .In order to collect only the most relevant videos, we only keep  the top-4 videos returned by YouTube.We deduplicate videos based on YouTube IDs because some videos may appear in multiple health-related instructional tasks.However, if a video was uploaded multiple times or edited and re-uploaded, the dataset may still contain duplicates.a doctor or a medical professional) demonstrates a procedure related to the medical query, or be a how-to video about the medical query.Medical instructional videos may target different levels of expertise, ranging from good Samaritans providing first aid to medical students learning a procedure or experienced clinicians interested in continuing medical education.For this study, we focus on the instructional medical videos that do not require medical education, i.e., the instructions should be at a level that is understandable and can be performed by a layperson.For example, if a nurse shows how to bandage a wound in an emergency, the video is instructional for a layperson.Conversely, if a doctor explains how to perform a specific surgical procedure, the video is instructional for professionals but not for the general public.

5.
Formulating Instructional Question and Visual Answer from Videos: With the aim of formulating medical and health-related questions and localizing their visual answer in the videos, we start with the medical instructional videos annotated in the previous step of the dataset creation.A question is called instructional if the answer requires a step-by-step demonstration and description of the actions to be taken to achieve the goals.For many medical questions, the answer to the question is better shown than described in words, and the answer will be hard to understand and act upon without visual aid, e.g., "how to perform a physical examination for breast abnormalities?" Three medical informatics experts were asked to formulate the medical and health-related instructional questions by watching the given video and localizing the visual answer to those instructional questions by providing their timestamps in the video.We asked the annotators to create questions for which (1) answers are shown or the explanation is illustrated in the video, (2) the given video is necessary to answer the question, and (3) the answer cannot be given as text or spoken information without visual aid.

MedVidCL Data Creation
A video question answering system that can provide the visual answers to medical or health-related instructional questions must have the capability to distinguish between medical instructional and non-instructional videos related to the user's questions.Towards building systems to perform this task efficiently and effectively, we created the MedVidCL dataset, which can be used to train a system that can distinguish amongst the medical instructional, medical non-instructional, and non-medical videos.The details of the approach to build MedVidCL dataset are described as follows: (A) Collecting Medical and Non-medical Videos: With an aim to reduce the annotation efforts, we follow a two-step process to build the MedVidCL dataset.In the first step, we seek a high confidence opinion on the video category from a pre-trained video classifier.In the second step, medical informatics experts validate the video category predicted by the video classifier.In order to train the video classifier, we begin with collecting medical and non-medical videos.We utilized the human-annotated 1, 016 medical instructional and 2, 685 medical non-instructional videos from MedVidQA dataset.To collect non-medical videos, we sampled 1, 157 videos of non-medical categories (Food and Entertaining, Cars & Other Vehicles, Hobbies and Crafts, Home and Garden Tools, etc.) from the HowTo100M [12] dataset, which is a large-scale YouTube video dataset with an emphasis on instructional videos.We perform a stratified split on this collection and used 80% videos for training, 10% for validation, and 10% for testing.
(B) Building Video Classifier: We focus on only coarse-grained (medical instructional, medical non-instructional and non-medical) categorization of the videos as opposed to the fine-grained (walking, running, playing, standing, etc.) video classification [13] where the micro-level human activity recognition is the key to correctly categorizing the video.Therefore, we hypothesized it is possible to predict the coarse-grained category from the natural language subtitles of the video.Towards this, we propose an ensemble classifier that aggregates the predictions of deep learning and statistical classifiers.We used the support vector machine (SVM) [14] as the statistical classifier in our ensemble learning setup, and we chose the pretrained BigBird [15] model as our deep learning classifier as BigBird is capable of accommodating longer sequences that are ubiquitous in video subtitles.We utilized the Hugging Face's implementation 4 of the BigBird model.After extracting the English video subtitles using the Pytube library 5 , we fine-tuned four different pretrained BigBird models, each with 1024 as the maximum token length on our training dataset.We also used early stopping to prevent overfitting the model.Since our training dataset has a skewed distribution of the classes, we consider penalizing the model in the training phase for the misclassification made for the minority class by setting higher class weight and at the same time reducing the weight for the majority class.For the class c ∈ C, we set the weight w c = N |C|×Nc , where C is the set of all classes in the dataset and N is the total number of samples in the dataset.N c is the number of samples associated with class c in the dataset.We follow the Population Based Training (PBT) [16] mechanism to jointly optimize a population of models and their hyperparameters to maximize performance.PBT inherits the idea of exploitation and exploration from genetic algorithms.In PBT, each member of the population exploits -taking into account the performance of the whole population, a member can decide whether it should abandon the current solution and focus on a more promising one -and explores -considering the current solution and hyperparameters, it proposes new ones to better explore the solution space.Following this, we fine-tune the BigBird with the PBT strategy (population size=5) and consider the top-2 performing members of the population as final models.We used two different PBT strategies to train the BigBird model.In one strategy, we penalize the model and call their top-2 members of the population as BigBird 1 w/ weight and BigBird 2 w/ weight .In another strategy, we train the BigBird models without penalizing them and call them BigBird 1 w/o weight and BigBird 2 w/o weight .We adopted the Linear SVC implementation 6 with the default hyper-parameters settings to train the SVM classifier on our training dataset.We used majority voting from predictions of all five different (4 BigBird + SVM) models in our ensemble learning setting.We break the ties with predictions from the best-performing classifier.The detailed video classification results are shown in Table 1.
(C) Identification of Relevant Videos : We sampled a subset of videos from the large-scale HowTo100M and YouTube8M [17] datasets, and we only choose medical and health-related videos from a set of predefined categories marked as appropriate by medical informatics experts.This process yields a collection of 66, 312 videos from HowTo100M and YouTube8M datasets.
(D) Predicting Relevant Video Categories from the Video Classifier: We utilized the ensemble classifier to predict the category of the relevant videos.The ensemble classifier predicted 13, 659 medical instructional videos, 5, 611 medical non-instructional videos, and 47, 042 non-medical videos from the collection of 66, 312 relevant videos.
(E) Sampling High-Quality Videos and their Manual Assessment: In order to create a high-quality dataset, we only chose the videos for which the classifier confidence was high for a specific video category and filtered out the videos for which the ensemble classifier confidence was low.In the first step, we filtered out all the videos from the predicted medical-instructional category for which the classifier confidence was below 0.8.A similar strategy was used for medical non-instructional (confidence score below 0.9) and non-medical (confidence score below 0.99).The second and final step involved the manual assessment of the classifier predicted videos.

MedVidQA Analysis and Validation
In the first step of the MedVidQA dataset creation, we aim to identify and use only trustworthy videos and YouTube channels.A video is categorized as a reliable video if it belongs to a YouTube channel from any of the following sources: (a) accredited medical schools with established reputations, (b) health institutes, (c) health education, (d) hospitals, (e) medical professionals or experts discussing a particular health-related topic, (f) or medical professional appearances and discussions on news channels.We have annotated a total of 6, 052 YouTube Videos and categorized 1, 016 as medical instructional, 2, 685 as medical non-instructional, 2, 076 as videos from unreliable video/channel, 140 as non-medical videos, and 132 as videos that can not be included in the dataset for other reason.A total of 4 medical informatics expert annotated these videos.To measure the agreements, we sampled 100 videos from the whole collection and asked all the annotators to categorize them into either medical instructional or medical non-instructional categories.We computed the pair-wise inter-annotator agreement (Fig. 4) amongst them using the Cohen's kappa coefficient [18], and we found strong agreements (average pair-wise kappa coefficients of 83.75) amongst them.
In the second step of MedVidQA creation, we focus on creating medical or health-related instructional questions.A total of three medical informatics experts formulated these questions and visual answers.This process yielded a total of 3010 pairs of medical questions and their visual answers from 899 medical instructional videos totaling 95.71 hours.We split the videos into training (800), validation (49), and testing (50) sets.We have provided the detail statistics in Table 3, Fig. 6, and Fig. 7. To validate the dataset, we sampled 50 videos and their question-answers annotated by one annotator and asked another annotator to formulate question-answers pairs from the same videos.We first manually assessed whether both the annotators had formulated semantically similar questions from the given video, and we then computed the absolute differences between answer timestamps for semantically similar questions formulated by both the annotators.With the first assessment, we measured the number of instances where both the annotators have the same agreement on formulating semantically similar questions from the videos.The second assessment validates their agreements on proving the precise and valid answer timestamps from the videos.We found that both the annotators formulated 93 and 96 questions, and 84 out of them were semantically similar.We computed the average absolute difference (AAD) of start and end timestamps of the visual answers.The AAD values for start and end timestamps are 2.53 and 3.37 seconds, respectively.Lower AAD values signify that both annotators consider almost the same answer timestamps whenever they create a semantically similar question.These assessments validate the quality of the MedVidQA dataset.

MedVidCL Analysis and Validation
To build the MedVidCL dataset, we chose human-annotated ('Medical Instructional', 'Medical Non-instructional' and 'Non-medical') videos from the MedVidQA dataset.We considered this set as the training set for the MedVidCL dataset.
To build a validation and test set, we sampled high confidence videos predicted by a video classifier.To further validate the dataset, we asked the medical informatics expert to review the video category predicted by the model.The expert was asked to correct the video category if the video classifier mislabels it and mark the videos as non-relevant if there is no conclusive evidence in the videos to label them into any of the video classification categories.The final MedVidCL dataset contains 6, 117 videos amongst which 1, 489 are medical instructional, 2, 994 are medical non-instructional and 1, 634 are non-medical.We further removed those videos which have a duration longer than 20 minutes.We have provided the dataset's detail statistics in Table 2.

MedVidQA Benchmarking
We benchmarked the MedVidQA dataset by performing a series of experiments using state-of-the-art natural language video localization approaches.We adopt the proposed architecture [19], which treats localization of the frames in a video as a span prediction task similar to answer span prediction [20,21] in text-based question answering.For a given input question and untrimmed video, we first extracted frames (16 frames per second) and obtained the RGB visual features V = {v 1 , v 2 , . . ., v n } ∈ R n×dv using the 3D ConvNet which was pre-trained on the Kinetics dataset [22].We also extracted the word representation of the question {w 1 , w 2 , . . ., w m } ∈ R m×dw using Glove embeddings [23].As done before [19], we utilized the character embedding {c 1 , c 2 , . . ., c m } ∈ R m×dc obtained from a convolutional neural network [24] to enrich the word representation and obtained the final question representation as dw+dc) .We encoded the question and video features using the feature encoder, which consists of four convolution layers, followed by a multi-head attention layer [25].We call the video and question representation obtained from the feature encoder as V ∈ R n×d and Q ∈ R m×d , respectively.We use the attention flow mechanism [26] to capture the cross-modal interactions between video and question features.
The attention flow mechanism provides the question-aware video feature representation Ṽ ∈ R n×d .The answers are located using the span predictor as discussed before [19].Particularly, it uses two unidirectional LSTMs -the first to predict the start timestamp and another to predict the end timestamp of the answer.
) are the weight matrices and biases, respectively.Finally, the probability distributions of the start and end positions are computed by P s = sof tmax(SC s ) ∈ R n and P e = sof tmax(SC e ) ∈ R n .
The network is trained by minimizing the sum of the negative log probabilities of the true start and end answer position by the predicted distributions P s and P e averaged over all samples in the batch.The network trained using the span prediction loss is called video span localization (VSL-BASE).We also experiment with the Query-Guided Highlighting (QGH) technique introduced prior [19] and call this new network VSL-QGH.
With the QGH technique, the target answer span is considered as the foreground and the rest of the video as the background.It extends the span of the foreground to cover its preceding and following video frames.The extension is controlled by the extension ratio α, a hyperparameter.An extended answer span aims to cover additional contexts and help the network focus on subtle differences between video frames.We use the 300 dimensional Glove embeddings and 50 dimensional character embeddings to obtain the word representation in both the VSL-BASE and VSL-QGH networks.We also use 1024 dimensional video features throughout the experiments and hidden state dimensions of 128 in both the LSTM and Transformer-based encoder.The VSL-BASE and VSL-QGH networks are trained using AdamW optimizer [27] for 30 epochs with an initial learning rate of 0.0001.The best-performing models are chosen based on the performance (IoU=0.7)on the validation dataset.
Benchmarking Metrics: We have evaluated the results using (a) Intersection over Union (IoU) that measures the proportion of overlap between the system predicted answer span and ground truth answer span, and (b) mIoU which is the average IoU over all testing samples.Following a prior protocol [28], we have used "R@n, IoU = µ", which denotes the percentage of questions for which, out of the top-n retrieved temporal segments, at least one predicted temporal segment intersects the ground truth temporal segment for longer than µ.In our experiment, we only retrieved one temporal segment; therefore, we have n = 1.Following previous studies [28,19], we have reported µ ∈ {0.3, 0.5, 0.7} to evaluate the performance of the VSL-BASE and VSL-QGH models.
Benchmarking Results and Discussion: We have performed extensive experiments (c.f.Table 4) with VSL-BASE and VSL-QGH models to evaluate the MedVidQA dataset.We start with the Random Mode approach, which randomly predicts the answer span based on the mode value of the visual answer lengths observed in the validation dataset.We also guess the answer span randomly and call the approach Random Guess.We have reported the results of random prediction on the MedVidQA test dataset in Table 4.
With the VSL-BASE model, we ran multiple experiments by varying the frame length from 400 to 1400 to assess its effect on the evaluation metrics.We observe that the VSL-BASE model performs better (except IoU=0.3) with a frame position length of 800.For IoU=0.3, an FPL (Frame Position Length) value of 1400 seems to outperform over the other variants of the VSL-BASE model.With an optimal frame length of 800, we perform our next set of experiments with the VSL-QGH model.
The VSL-QGH models depends on the extension ratio α, and the network is trained with join span prediction and visual region (foreground or background) prediction losses.We experiment with the VSL-QGH model by varying the α from 0.05 to 0.3 and reported the results in Fig. 8.It can be visualized from Fig. 8 that the model outperforms with α = 0.25.We reported the result for the VSL-QGH model with its optimal value of α in Table 4.The VSL-QGH model obtained the 25.81 IoU=0.3,14.20 IoU=0.5, 6.45 IoU=0.7, and 20.12 mIoU.The performance of the VSL-QGH model in terms of mIoU (20.12) is slightly lower (↓ 0.03) than the VSL-BASE with FPL value of 800.The results show that the visual answer localization is a challenging task, where the model should have the capability of inter-modal communication to locate the relevant frames in the videos.With multiple useful applications of medical visual answer localization in healthcare and consumer health education, we believe that the MedVidQA dataset and benchmarked setup can play a valuable role for further research in this area.

MedVidCL Benchmarking
We benchmarked our MedVidCL dataset with multiple monomodal and multimodal approaches.For monomodal approaches, we built several models by considering the language (video subtitle) and vision (video frames) separately for different models.To develop language-based monomodal approaches, we extracted the English subtitles from the videos using the Pytube library 7 .We then trained statistical classifiers such as Linear SVC [30], and SVM [14] to predict the video category by considering the language features.We have also experimented with pre-trained language models such as BERT-Base-Uncased [31], RoBERTa-Base [32] and BigBird-Base [15]  monomodal approaches, we extracted 20 frames from each video at a uniform time interval and used these frames as the sequence information to predict the video category.To process these frames, we utilized a 3D ConvNet (I3D), which was pre-trained on the Kinetics dataset [22], and the resulting features were passed as input to both LSTM and Transformer networks to predict the video category.We have also experimented with the pre-trained 8 ViT [33] model for extracting frames and obtained the frame representation from the ViT feature extractor.Similar to the I3D, we passed the resulting features to LSTM and Transformer networks to predict the video category.We extended our experiments from monomodal to multimodal settings by considering both the language and vision features.Using the language input (video subtitle) and vision input (video frames), we obtained their representations either from LSTM or Transformer networks.We then concatenated the language and vision features together and pass the concatenated features to a feed-forward layer to predict the video category.Similar to the monomodal (vision) experiments, we use both I3D and ViT features to perform the multimodal experiments.
We chose the hyper-parameters values based on the classifier performance (average macro F1-score) on the validation dataset.For the Linear SVC and SVM classifiers, the optimal regularization value C was 1.5 and 1, respectively.The SVM model with the sigmoid kernel outperformed the other kernels on the validation dataset.We utilized the pre-trained Transformer models from Hugging Face [34] to perform the monomodal (language) experiments.Each pre-trained Transformer model was trained with the AdamW optimizer with the learning rate=5e − 5 for 10 epochs and with the batch size of 8 (except for BigBird, where the batch size was 4).The LSTM and Transformer hidden states ertr set to 128 and 1024, respectively, for all the monomodal (vision) and multimodal experiments.Each monomodal (vision) and multimodal network was trained with the learning rate=5e − 5 for 30 epochs with the batch size of 16.We set the maximum text sequence length for multimodal experiments to 512.
Benchmarking Metrics: The evaluation metric to assess the performance of the systems are (a) F1-score on Medical Instructional Class, and (b) macro average F1-score across all the classes.Benchmarking Results and Discussion: We have provided the detailed results of multiple monomodal and multimodal approaches in Table 5.Among language-based monomodal approaches, BigBird-Base outperforms other methods and achieved 95.68% overall macro average F1-score and 94.28% F1-score for medical instructional class.Pre-trained Transformer-based models perform better than their counterpart SVM variants.Since BigBird can accommodate and handle longer sequences effectively, which is plausible in video subtitle, it shows better performance than the other pre-trained language models (BERT and RoBERTa).Among vision-based monomodal approaches, the feature representation learned using ViT (81.26 overall F1-score) is more effective than I3D (74.43 overall F1-score) with Transformer-based frame sequence processing.With multimodal approaches, we observed improvements over the respective monomodal (vision) approaches.We observed the maximum improvement of 1.12% overall F1-score with the multimodal approach (L + V (ViT) + Transformer) compared to the monomodal (ViT + Transformer) approach.Similar trends are also observed for Medical Instructional video classification, where we reported an increment of 1.12% F1-score with the multimodal approach (L + V (ViT) + Transformer) compared to the monomodal (ViT + Transformer) approach.
For the visual answer localization to the health-related questions, it is essential to predict the relevant medical instructional videos correctly; therefore, we prioritize the system's performance on medical instructional classes compared to the overall video classes.In this case, the F1-score on medical instructional videos is more important than the overall F1-score.We believe that a sophisticated language-vision fusion mechanism will further improve the performance (overall F1-score and Medical Instructional F1-score) of the multimodal approaches.

Figure 2 :
Figure 2: The schematic workflow of the MedVidQA dataset creation.Each step N is discussed in MedVidQA Data Creation.

Figure 3 :
Figure 3: The schematic workflow of the MedVidCL dataset creation.It starts with the collection of medical instructional and non-instructional videos from step 4 of the MedVidQA dataset creation.Each step X is discussed in MedVidCL Data Creation.

Medical and Health-related How-to Instructional Task Medical Informatics Expert to Filter Relevant Tasks Medical Instructional Tasks Querying YouTube For the Relevant Videos Videos Provided by YouTube for Medical Instructional Tasks Medical Instructional Video MedVidQA Dataset 1 2 3 5 Decision Medical Non-instructional Video Filtered Medical Instructional Videos from Reliable Sources 4 Annotation Process: Expert to Formulate the Question and Mark the Answer Timestamps in the Video Crawling WikiHow for Health-related Tasks
DHDFmiQVcnY How to check if the victim has sustained a spinal injury?01:05 02:15 .... .... .... .... S_TInkGhV4s How to apply pressure on the wound?01:01 01:31

Table 1 :
Performance comparison (on test dataset) of the different video classifiers used in creating MedVidCL dataset.All reported results demonstrate the macro average performance.
Figure 5: Distribution of the instructional tasks category selected from the WikiHow.

Table 2 :
Detailed class-wise statistics of the MedVidCL dataset.

Table 3 :
[29]r vision-based Detailed MedVidQA dataset statistics for questions, videos, and visual answers.Question length denotes the number of tokens in the questions after performing tokenization with NLTK[29]tokenizer.

Table 4 :
Performance comparison of the variants of VSL models on MedVidQA dataset.Here FPL refers to the frame position length considered during training the respective models.

Table 5 :
Performance of the monomodal and multimodal approaches on MedVidCL test dataset.Here L and V denotes the Language and Vision respectively.Precision, Recall and F1-score denotes macro average over all the classes.The best results amongst monomodal (language) approaches is highlighted with green shade.Similarly, we shown the best results from monomodal (vision) and multimodal are shown in yellow and purple shades.
mIoU Figure 8: Effect of extension ratio (α) on the performance of VSL-QGH model on MedVidQA test dataset.