A comparative investigation of machine learning algorithms for predicting safety signs comprehension based on socio-demographic factors and cognitive sign features

This study examines whether the socio-demographic factors and cognitive sign features can be used for envisaging safety signs comprehensibility using predictive machine learning (ML) techniques. This study will determine the role of different machine learning components such as feature selection and classification to determine suitable factors for safety construction signs comprehensibility. A total of 2310 participants were requested to guess the meaning of 20 construction safety signs (four items for each of the mandatory, prohibition, emergency, warning, and firefighting signs) using the open-ended method. Moreover, the participants were asked to rate the cognitive design features of each sign in terms of familiarity, concreteness, simplicity, meaningfulness, and semantic closeness on a 0–100 rating scale. Subsequently, all eight features (age, experience, education level, familiarity, concreteness, meaningfulness, semantic closeness, and simplicity) were used for classification. Furthermore, the 14 most popular supervised classifiers were implemented and evaluated for safety sign comprehensibility prediction using these eight features. Also, filter and wrapper methods were used as feature selection techniques. Results of feature selection techniques indicate that among the eight features considered in this study, familiarity, simplicity, and meaningfulness are found to be the most relevant and effective components in predicting the comprehensibility of selected safety signs. Further, when these three features are used for classification, the K-NN classifier achieves the highest classification accuracy of 94.369% followed by medium Gaussian SVM which achieves a classification accuracy of 76.075% under hold-out data division protocol. The machine learning (ML) technique was adopted as a promising approach to addressing the issue of comprehensibility, especially in terms of determining factors affecting the safety signs' comprehension. The cognitive sign features of familiarity, simplicity, and meaningfulness can provide useful information in terms of designing user-friendly safety signs.

Subjects and sampling. The study population comprised 2310 male construction workers between the age of 18 and 63 years from different districts of the major metropolitan city of Tehran, Iran. The three-stage sampling method was utilized. At first, a stratified sampling method was used to identify five clusters based on population distribution in Tehran. In the second stage, after providing the list of all the construction projects located in selected clusters, a systematic random sampling method was applied to choose five construction projects per cluster. The required minimum sample size of 400 subjects in each cluster (80 for each construction project) was determined using the formula, where Z 1−α/2 = 1.96 (the value of normal deviate at 0.05 level of confidence), Z 1−β = 0.85 (the value of normal deviate at the study power of 0.8), d = 2.4 (expected absolute allowable error in the mean), and s = expected standard deviation of 17.1 according to the study conducted by Chan et al. 29 . Considering the "Design effect" for clustered sampling method (Deff = 2.2) 30 , the desired sample size was obtained to be 2310 subjects with about a 10% non-response rate.
All participants were Persian-speaking with self-declared normal or corrected-to-normal vision and good mental and physical health status at the survey time. Those who disagreed to participate had blurred or poor vision and diabetes and were not enrolled in the study. Participants were given information on what the study was about. Informed consent was obtained from all subjects and/or their legal guardian(s). Participants were assured of the complete confidentiality of the study and data and results were kept secured based on local instructions of the University for Data Protection Act. The study protocol was approved by the Research and Ethics Committee of the Iran University of Medical Sciences (Reg. IR.IUMS.REC.1397.177). All methods were performed following relevant guidelines and regulations.
Safety sign selection. Safety signs for a varied range of hazard types were included to foster greater generalizability of the test results. 15 health and safety experts participated in the safety signs selection process. The experts were identified and selected according to the snowball technique (also known as chain-referral sampling), which is a non-probability (non-random) sampling method used when characteristics to be possessed by samples are rare and difficult to find 31 . To select the safety signs, all 220 safety signs of the ISO 3864-2:2016 standard 32 (42 mandatory signs, 42 prohibition signs, 50 emergency signs, 55 warning signs, and 31 firefighting signs) were printed color in squares of 2 × 2 cm on separate white papers. Then, these signs were sent to safety experts, and they were asked to select in such a way that are infrequently used and have a certain type and purpose in all five categories including mandatory, prohibition, emergency, warning, and firefighting. Finally, 20 safety signs including 4 mandatory signs (with code M1-M4), 4 prohibition signs (with code P1-P4), 4 emergency signs (with code E1-E4), 4 warning signs (with code W1-W4), and 4 firefighting signs (with code F1-F4) were selected. Figure 1 shows the final set of safety signs with their code and their respective intended meanings.
Experimental design and procedure. The data were collected using a questionnaire with three sections in the native language of the participants (Persian). In the process of designing the questionnaire, contributions of industrial and organizational psychologists, health and safety specialists, civil engineers, and enforcement agencies resulted in a construction characteristics portion and a construction safety signs evaluation portion.
Socio-demographic characteristics. The first part comprised questions including age, education level, years of experience, occupational status, and previous sign-related knowledge. Since subjects' prior knowledge and experience could affect the results of the study, these people were excluded.
Safety signs comprehensibility. For the evaluation portion of the construction safety sign (second part), 20 signs were printed as color photographs (approximately 7 × 7 cm in size) on a separate sheet of A4 white paper (correct meanings were not included). The papers were evenly assigned to 10 test booklets, within 20 non-duplicated safety signs. Each participant responded to only one test booklet randomly attributed to him. The basic method of assessment was open-comprehension testing as described in ANSI (American National Standards Institute) Z535. 3 (2007b) 33 and ISO (International Organization for Standardization) 9186 (2001) 34 . The examiner verbally asked the participant the following questions: (1) Have you ever seen this sign? (2) What is the meaning of this safety sign? (3) What should be done when this safety sign is seen? In addition to the verbal questioning, the questions were also printed on sheets that each participant could read at the same time. This procedure was suggested by ISO 9186 (2001) and was thus used in determining the comprehension correctness level in the present study 34 . Participants were tested individually and gave oral answers for the entire experimental procedure.
Comprehension data were obtained separately for the pictorial signs and the signs' background color and shape code. Authors with other two graphic/communication design experts' judges individually scored all participant responses. While doing the scoring, the judges had each symbol's intended meaning and the participant's written responses. Their task was to decide, independently, whether the participants' interpretations were matching to the intended meanings of signs by assigning a score of "1" to correct responses and a score of "0" to incorrect ones. If the three judges were unable to agree on the judgment for a response, consensus-based decision makings were used. To ensure the reliability of this process, inter-rater reliability was calculated by averaging the 1. A correct comprehension of the sign meaning is certain (estimated probability of correct understanding over 80%). 2. A correct comprehension of the sign meaning is very probable (estimated probability of correct understanding between 66 and 80%). 3. A correct comprehension of the sign meaning is probable (estimated probability of correct understanding between 50 and 65%). 4. The meaning, which is understood, is opposite to that intended. 5. Any other response. 6. The response given is "don't know". 7. No response is given.
Then, the percentage of participants' responses obtained in the first three categories was multiplied by a factor of correction, described in ISO 9186 (2001), as follows: www.nature.com/scientificreports/ The sum of these three values was labeled as "Score". The percentage of responses classified as the opposite (category 4) was subtracted from the "Score" resulting in the "Overall Score". The presence of negative scores is explained by the existence of high percentages of opposite meanings that were generated (i.e., critical confusion).
A criterion used for sign comprehension testing was adapted to fit the role of measuring participants' interpretation of the shape-color background meaning (separate from the sign). The shape-color code was assessed relative to the following: • Mandatory: round shape, a white symbol on blue background. • Prohibition: round shape, a black symbol on white background, red edging, and diagonal line. • Emergency: square or rectangular shape, a white symbol on a green background. • Firefighting: square or rectangular shape, a white symbol on a red background.
• Warning: equilateral triangle shape, a black band with a black symbol on yellow background.
This evaluation was performed from the answers given to the question "What do you think the sign means?" Completely correct responses should include the meaning of the symbol and the shape-color code. Critical confusion were assessed by responses attributing the opposite meaning to the shape and color components. To this purpose, participants' answers to the question "What action would you take in response to this safety sign?" were evaluated. The criterion for safety sign acceptance is at least 85% of test subjects correctly interpret the icon/ pictogram and no more than 5% of subjects are critically confused, based on the ANSI Z535.3 recommendations 33 . Also, ISO 3864 was used as a similar comprehension criterion for safety signs with a minimum correct recognition rate of 66.7% 32 .
Cognitive sign features. In the third section, the cognitive sign features test was provided to record subjects' viewpoints about each construction safety sign, proposed by Mcdougall et al. 18 . The authors reported strong validity and reliabilities for the original version, leading several researchers to use it thereafter 35,36 . The Persian version of this questionnaire, validated by Taheri et al. (2018), was applied in the present study 37 . The cognitive sign features sheets considered five features namely familiarity, concreteness, simplicity, meaningfulness, and semantic closeness. Familiarity refers to the rate at which a sign has ever been encountered. Signs are considered concrete if they are drawn similarly to real objects. The criterion of simplicity indicates the degree to which the signs are detailed. Meaningfulness indicates how meaningful users perceive a sign. Semantic closeness refers to the closeness of the association between what is depicted on a sign and what it is intended to represent. Complete explanations about the meaning of the five cognitive sign features and the rating instructions were given to each participant. Participants were requested to subjectively rate the design features for each safety sign on a 0-100 point scale for familiarity (0 = very unfamiliar, 100 = very familiar), concreteness (0 = clearly abstract, 100 = clearly concrete), simplicity (0 = very complex, 100 = very simple), meaningfulness (0 = completely meaningless, 100 = completely meaningful), and accuracy of semantic closeness (0 = very weakly related, 100 = very strongly related). The ratings were marked on 5-item questionnaires embedded under the given sign on each page of the test booklet (described above). The total time to complete a test booklet took about 30-45 min for each participant. The process was repeated until all safety signs were completely rated. The entire interview process was guided by a sole investigator (the second author). The local research ethics committee approved the study protocol.
Descriptive analysis. Statistical analysis was performed by SPSS 23 (IBM Corporation, New York, NY, United States). The normality test was carried out using the Kolmogorov-Smirnov test for all data sets. Statistical outliers were checked using the Grubb′s test which is based on the difference between the mean of the sample and the most extreme data considering the standard deviation 38 . Relative and absolute reliability was assessed for the comprehension performance test using the Intra-class Correlation Coefficient (ICC) and standard error of the measurement (SEM), respectively. Basic descriptive statistics such as means, frequencies, and percentages were calculated for both demographic characteristics as well as cognitive sign features and comprehension performance scores. An analysis of covariance (ANCOVA) with Bonferroni-adjusted post-hoc tests was then performed to test the effects of socio-demographic factors and cognitive sign features included in the study on the comprehension rate.
Statistical learning approach. The proposed archetype for the prediction of safety signs comprehensibility using socio-demographic and cognitive signs features in the ML approach is presented in Fig. 2. The left side and right side of Fig. 2 show the offline system (training phase) and online mode (testing phase), respectively. The implementation steps of these phases are explained in the following sections along with details of the dataset used in this study.
Feature selection. The selection of reliable factors plays a crucial role in safety signs comprehensibility representation and classification using machine learning (ML) techniques. Feature selection is a procedure of choosing the most pertinent features and building a sensible model with better prediction power for signs comprehensibility. Broadly, feature selection techniques are classified into two types namely, filter and wrapper methods. Filter methods measure the relevance of features by their correlation with the corresponding variable while wrapper methods attempt to find the "optimal" feature subset by iteratively selecting features based on the classifier performance. In this study, we used filter methods to determine the rank of features and select the relevant features by some principal criteria such as Information Gain (IG) 39 , Pearson's correlation coefficient (P) 40 , 1R 41 , Gain Ratio (GR) 42 , Relief-F (RF) 43 , and Symmetrical Uncertainty (SU) 44  www.nature.com/scientificreports/ wrapper feature selection (CFS) approach was used to select the most reliable subset of components 45 . This method generates different possible subsets from the given number of features and then evaluates them using a specific objective function. We kept the subset of features with the highest performance and discarded all other subsets. Further, a robust rank aggregation (RRA) technique, as a hybrid approach, was also implemented and evaluated 46 .
Classification. The final phase of any ML approach is the classifier which maps input feature vectors x ∈ X to output class labels y ∈ {1,…, n}, where X is the feature space and n is the total number of classes. Classification techniques are broadly classified into two types namely, supervised, and unsupervised. In a supervised classifier, the training samples are supplied along with their class labels. The class label of unknown cases i.e. the test samples is then determined based on the parameters of the trained classifier model. In this study, some of the most popular supervised classifiers such as Binary Logistic Regression (BLR), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Classification and Regression Tree (CART), Support Vector Machine (SVM), Random Forest (RF), Bootstrap Aggregating (also known as Bagging) algorithm, K-Nearest Neighbor (K-NN), and Adaptive Boosting were used to predict which of the socio-demographic factors and cognitive  www.nature.com/scientificreports/ sign features (i.e. independent variables) are importance on the safety signs comprehensibility (i.e. dependent variable). We chose these classifiers because, according to the literature, these classifiers have been efficaciously used in previous Computer-Aided Diagnosis (CAD) studies [47][48][49][50][51] . The overall machine learning analysis was programmed using Scikit-Learn 0.20.3, a popular Python ML library 52 .
Performance evaluation metrics and methods. The various performance metrics used to evaluate the classifiers are classification sensitivity or recall, specificity, accuracy, precision, F1 score, and area under the curve (AUC) 53 . Sensitivity or recall is the performance of a classifier to correctly categorize a person with correct comprehensibility as a positive class; specificity is the performance of a classifier to correctly categorize a subject with incorrect comprehensibility as a negative class; accuracy is the fraction of the individual who was correctly classified as a positive or negative class by an ML model; precision, also known as a positive predictive value, is the fraction of the true positive class among the workers who were predicted as a positive class; F1 score is the harmonic mean of precision and recall 54 . Along with these performance measures, the area under the receiver operating characteristics (AUC) is also used to compare classifier models. The mathematical formulas to calculate the above performance metrics are shown in Eqs. (1)-(5) 53,55 .
where, TP: true positive, FP: false positive, FN: false negative, and TN: true negative.
Data division protocol. K-fold cross-validation was used in this study to compare the model performance with that of existing predictors which is the most popular and extensively acknowledged by the research community. In this approach, the whole dataset was divided into 'k' groups, consisting of an approximately equal number of samples. Out of the 'k' groups, 'k − 1' groups are used for training the classifier model while the remaining group is used for testing purposes 56 . The process is repeated 'k' times and average performance over 'k' rounds is calculated. In this study, experiments were conducted with the desired value of k = 10, and the average results were used to evaluate the model 57,58 .

Results
Demographic characteristics. The experiment included 2310 construction male workers ranging in age from 18 to 63 years (mean = 45.31, SD = 11.27). All of the participants had at least 5 years and more (between 6 and 45 years) of construction work experience (mean = 16.45, SD = 2.13) and more than half of the total subjects were in the 18-35 age range. The main demographic characteristics of the sample are reported in Table 1.
Comprehension score of signs. Table 2 shows the overall scores (mean ± SD) for comprehension of pictorial symbol meaning and shape-color code. "Do not use this lift" (P7) and "No Smoking" (P5) signs had minimum and maximum comprehension of sign meaning (16.6% for P7 and 89.4% for P5) and shape-color code (− 4.1% for P7 and 92.2% for P5), respectively. The American National Standard Institute (ANSI) and Organization for International Standardization (ISO) have recommended that symbols must reach a criterion of at least 85% or 67% correct, respectively, in a comprehension test to be considered acceptable 59 . As shown in Table 2, there was only one safety sign reaching both the ISO and ANSI criteria, "No smoking" (P5; 89.4%). Another seven safety signs achieved the lower criteria of ISO only, namely the "Wear eye protection" (M4; 71.4%), "No naked flames" (P6; 78.8%), "No entry" (P8; 71.8%), "First aid" (E9; 77.7%), "High voltage" (W13; 73.6%), "Fire extinguisher" (F17; 81.2%), "Firefighting equipment" (F20; 67.6%). The overall mean sign comprehension scores across all safety signs for each of the five sign groups were: • (1) Sensitivity or recall = TP TP + FN Comprehension score of shape-color code. Comprehension of the safety signs' shape-color coding was also checked out (see Table 2). The 67% level (similar to the ISO sign comprehension criterion) and the 85% level (similar to the ANSI sign comprehension criterion) were used as standard acceptability criteria to compare to the levels found in the present study. There were seven safety signs reaching both the ISO and ANSI criteria, namely the "Wear eye protection" (M4; 86.7%), "No smoking" (P5; 92.2%), "No entry" (P8; 87.5%), "First aid" (E9; 89.1%), "Fire extinguisher" (F17; 88.5%), and "Firefighting equipment" (F20; 86.4%). Only 14 out of 20 safety signs attained 67% comprehension criterion for shape-color in the present study (signs M1-Wear safety helmet; M2-Wear protective footwear; M4-Wear eye protection; P5-No smoking; P6-No naked flames; P8-No entry; E9-First Aid; E10-Drinking water; E11-Emergency telephone; W13-High voltage; W15-Slippery floor; F17-Firefighting extinguisher; F19-Fire emergency telephone; F20-Firefighting equipment). Table 2 have shown that several instances of the signs' shape-color coding were poorly comprehended (signs M3-Wear safety harness; P7-Do not use this lift; E12-Emergency stop; W14-Falling objects; W16-Overhead crane; F18-Fire hose reel). The mean shape-color code comprehension scores across all safety signs for each of the five sign groups were: The data show that the firefighting signs attained a somewhat higher level of sign shape-color code comprehension than the warning, emergency, prohibition, and mandatory signs. The overall mean sign shape-color code comprehension across participants was 67.41% (SD = 18.27), ranging from − 18.25 to 98.21. Friedman's two-way analysis of variance by ranks test revealed that there was a significant effect of sign group on the sign shape-color comprehension (χ 2 (4) = 16. Also, there was a significant difference between each of the mandatory and emergency signs with prohibition and warning signs; but there were no significant differences between the mandatory and emergency signs. www.nature.com/scientificreports/ Table 3 shows that 7 of the 20 signs generated at least some critical confusion (opposite answers). Scores with bold markings in Table 3 show the particular signs that exceeded the ANSI Z535.3 acceptability level of attaining more than 5% critical confusion for comprehension of pictorial symbols and shape-color code. According to ANSI Z535.3, signs that exceed the 5% critical confusion level should be rejected. Based on this, three safety signs would be rejected based on comprehension scores of sign meaning and shape-color. These signs were: M3-"Wear safety harness", P7-"Do not use this lift", W14-"falling objects". Generally, the workers had the largest number of critical confusions for shape-color comprehension of signs than comprehension of sign meaning.
Cognitive sign features. The safety signs' features were evaluated on five categories using a 0-100 rating scale. All of the mean ratings exceeded 60 percent, which was the highest rating related to meaningfulness (71.47). Table 4 shows the signs with the lowest and highest ratings on cognitive sign features. Although all the subjects were experienced workers, sign E12 (emergency stop) was rated as very unfamiliar (37.25). The most familiar one signed P5 (no smoking). The sign P6 (no naked flames) was perceived to be very simple and definite while the signs M3 (wear safety harness) and W16 (overhead Crane) were identified as the most complex and somewhat vague, implying that the perceived simplicity of a sign is not only related to the number of elements in the sign but may be affected by other factors such as sign concreteness or meaningfulness. The sign M3 (wear safety harness) had the lowest concreteness rating (39.28) and lowest meaningfulness rating (47.08). The E17 (fire extinguisher) sign had the highest meaningfulness rating (89.21) and semantic closeness rating (82.07), , and four (MS/Ph.D., bachelor's degree, high school, less than high school) categories. Two-way analysis of variance (ANOVA) was used to analyze the difference among group means and presented in Table 5. Table 5 shows that the level of workers' comprehensibility of prohibition, emergency, warning, and firefighting safety signs varies significantly with the age group (p-values < 0.001). On the other hand, the level of workers' comprehensibility of mandatory signs isn't affected by worker age (p-values = 0.230). To find out which age group has the highest effect on prohibition, emergency, warning, and firefighting safety signs comprehensibility, post-hoc tests by Bonferroni were used. For prohibition, warning, and firefighting safety signs, the age group of 36-45 years had higher comprehensibility (71.3%) than the age groups less than 25 years (53.6%) and older than 56 years (56.2%). For emergency safety signs, the age groups of 36-45 years and 46-55 years had higher comprehensibility (65.2%, and 69.4%) than the age group of fewer than 25 years (59%). www.nature.com/scientificreports/ In examining the effect of education level on workers' comprehensibility, the workers' educational level had a significant effect only on the comprehensibility of firefighting signs (p-values = 0.042). Based on the Post-hoc test results by Bonferroni, participants with MS/Ph.D. and bachelor degrees had higher comprehensibility (70% and 68.4%) than the participants with an education level of less than high school (55.4%).
To find out if there are any statistically significant differences in participants' comprehensibility with working experience; a Two-way ANOVA test was used. Table 5 shows that working experience related to construction has a significant effect on the participants' comprehensibility of all safety signs (p-values < 0.001). To find out which working experience has the highest effect on safety signs' comprehensibility, post-hoc tests by Bonferroni were used. It can be concluded that participants with a working experience of 16-45 years had a higher degree of www.nature.com/scientificreports/ comprehensibility than those with 5-15 years of working experience (p-values < 0.001), whereas, no significant difference in the comprehensibility of safety signs were observed between workers with a working experience of 16-30 and 31-45 years (p-values < 0.001).

Relationships between socio-demographic factors and cognitive sign features with safety sign comprehensibility.
In this study, the scores of the cognitive sign features were normally distributed (Kolmogorov-Smirnov, P > 0.05). Pearson's correlation test was carried out, in each signs categories, to evaluate if there were significant correlations between the measured sign meaning and shape-color code comprehension with users' factors and cognitive sign features (see Table 6).
Results of feature selection. Table 7 shows the results of various feature selection techniques. As a result of Pearson's correlation test in Table 6, it was found that only one feature namely "education level" was not correlated with safety sign comprehension (P > 0.05). However, the results of feature selection techniques in Table 7 revealed that "education level" can also be a significant feature for sign comprehensibility classification. Thus, experiments were conducted initially for all possible combinations including top 5, top 4, top 3, top 2, etc. In filter-based methods, features are arranged in decreasing order of their rank while in wrapper-based methods, the best subset of features is selected. It is found that the rank assigned to various features by different feature selection techniques is slightly different. For example, if Pearson's correlation coefficient (P) is used as the principal criterion, "Familiarity" is considered the most reliable factor. On the other hand, if RF is used as the principal criterion, "Simplicity" is considered the most reliable factor. Similarly, "Experience" is assigned the second rank if P is used as the principal criteria while it is assigned the sixth rank if GR or IG is used as the principal criteria. It is thus concluded that relying on one principal criterion may not always result in an optimal subset of factors. An optimal subset of factors elected using one assessment measure may not be similar to that using another. The performance of various feature selection techniques is evaluated using kernel-based SVM. The corresponding results are presented and discussed in the forthcoming section.

Results of classification using kernel-based SVM. This section presents the results of different SVM
classifiers with and without using the feature selection step. Six performance measures (accuracy, sensitivity, specificity, precision, F1-score, and AUC) were used for evaluation under tenfold cross-validation. Table 8 shows   www.nature.com/scientificreports/ the performance of different SVM classifiers without using the feature selection technique (i.e. all the eight socio-demographic factors and cognitive sign features are supplied to the input of the classifier). It is found that medium Gaussian SVM outperforms other classifiers achieving the highest classification accuracy of 75.660% without using feature selection, under tenfold cross-validation. On the contrary, the course Gaussian SVM performs worst achieving the lowest classification accuracy of 54.681% under the tenfold data division protocol. Table 9 shows the performance of different SVM classifiers when the top 3 features namely, familiarity, experience, and concreteness selected by Pearson's correlation coefficient (P) are supplied as input to the classifier. It is found that the medium Gaussian SVM classifier under tenfold cross-validation outperforms others achieving a classification accuracy of 73.760%. However, fine Gaussian SVM and Linear SVM achieve a higher classification accuracy of 71.008% and 70.102%, respectively. The worst performance is demonstrated by the Cubic SVM classifier displaying the lowest classification accuracy under all data division schemes with an accuracy of 52.208%. Table 10 shows the performance of different SVM classifiers when the top 3 features namely, familiarity, simplicity, and concreteness selected by information gain (IG), gain ratio (GR), and symmetrical uncertainty (SU) were supplied as input to the classifier. It is found that the medium Gaussian SVM classifier outperforms others under all data division schemes with an accuracy of 75.615%. On the other hand, categories of test samples predicted by cubic SVM match least with ground truth categories resulting in its lowest classification accuracy under tenfold cross-validation. Table 11 shows the performance of different classifiers when the top 3 features namely, familiarity, simplicity, and meaningfulness selected by 1R and Relief-F (RF) are supplied as input to the classifier. It is found that the medium Gaussian SVM classifier outperforms others under all data division schemes. It achieves the highest classification accuracy of 83.210% under tenfold cross-validation. On the contrary, coarse Gaussian SVM results in the lowest classification accuracy of 68.540%. It is interesting to note here that compared to all other feature combinations, the combination of familiarity, simplicity, and meaningfulness achieves the highest classification accuracy of 83.210%. Table 8. Performance of various SVM-based classifiers without using feature selection under tenfold crossvalidation. Significant values are in bold.  Table 9. Performance of various SVM-based classifiers using top 3 features selected by Pearson's correlation coefficient (P) feature selection evaluated by tenfold cross-validation. Significant values are in bold.  Table 10. Performance of various SVM-based classifiers using top 3 features selected by the gain ratio (GR), information gain (IG), and symmetrical uncertainty (SU) feature selection evaluated by tenfold crossvalidation. Significant values are in bold. www.nature.com/scientificreports/ Table 12 shows the performance of different classifiers when the best subset of features selected by correlation-based wrapper feature selection (CFS) and the top 3 features selected by robust rank aggregation (RRA) is supplied as input to the classifier. The feature combination evaluated in this case is simplicity, familiarity, and experience. As in all of the previous cases, it is found that the medium Gaussian SVM classifier outperforms others under all data division schemes. It achieves the highest classification accuracy of 76.075% under tenfold cross-validation.

Classification technique Accuracy (%) Sensitivity (%) Specificity (%) Precision (%) F1-score AUC
From the results of Tables 8, 9, 10, 11 and 12, it is concluded that the feature combination of familiarity, simplicity, and meaningfulness achieves the highest classification accuracy. To study and confirm the impact of these factors on safety signs comprehensibility, some other popular classifiers such as binary logistic regression (BLR), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), classification and regression tree (CART), random forest (RF), bootstrap aggregating algorithm, k-nearest neighbor (K-NN), and Adaptive Boosting were also evaluated as discussed in section "Classification" (see Table 13). It is observed that when familiarity, simplicity, and meaningfulness were used as features, the K-NN classifier achieves the highest classification accuracy of 94.369% under tenfold cross-validation. This shows that familiarity, simplicity, and meaningfulness together can have a significant impact on the prediction of safety signs comprehensibility using machine learning techniques. Other classifiers such as adaptive boosting (AdaBoost) and random forest (RF) also performed satisfactorily achieving classification accuracy of 85.260% and 83.102% under tenfold cross-validation, respectively. These results are very much comparable to those by SVM. To establish the statistical significance of improvement in classifier performance from 83.210% using medium Gaussian SVM-tenfold (see Table 11) to 94.369% using K-NN-tenfold (see Table 13), z-statistic was calculated at 95% confidence interval using approach explained in Table 11. Performance of various SVM-based classifiers using top 3 features selected by1R and Relief-F (RF) feature selection evaluated by tenfold cross-validation. Significant values are in bold. www.nature.com/scientificreports/ Isaac (2015) study for test concerning two proportions 60 . The z-statistic is found to be -2.204 with a p-value of less than 0.05 at a 95% confidence interval. This confirms that the improvement in classification accuracy of the K-NN classifier over the medium Gaussian SVM classifier is statistically significant. Analyzing the results of Tables 8, 9, 10, 11, 12 and 13, it was found that the best combination of sensitivity, specificity, precision, F1-score, and AUC is achieved by the K-NN classifier under tenfold cross-validation when familiarity, simplicity, and meaningfulness were supplied as input to the classifier model. The values of the rest of the performance metrics such as sensitivity, specificity, precision, F1 score, and AUC were 95.511%, 94.276%, 95.432%, 0.950, and 0.991, respectively. It is also observed that, for most of the feature combinations, sensitivity is high while specificity is low.

Discussions
In the past few decades, a large body of safety signs research has examined how to sign characteristics (such as symbol, shape, color, and incongruent information), socio-demographic factors (such as gender, age, culture, education level, work experience), and cognitive sign features impact safety signs comprehensibility 8,16,32 . These studies provide basic principles and guidelines for the design of more effective safety signs; however, the present study takes a step further using general-purpose learning algorithms to find patterns in often rich and unwieldy data that affect sign comprehension. This study assesses the safety signs comprehensibility that is used to reduce or eliminate hazards in the working environment utilizing the hierarchy of risk controls and to be part of engineering/administrative control 61,62 . This is the first study, to our knowledge, to examine the effects of sociodemographic factors and cognitive sign features on the comprehensibility performance of safety signs among construction workers using eight different feature selection techniques and various popular classifiers of machine learning (ML) approaches. In addition, supervised machine learning models presented in this study can reduce the bias existing in the workforce when making a vigilant decision on the safety signs' comprehensibility 63,64 . In this study, a database of socio-demographic factors and cognitive sign feature measurements were captured and utilized for safety sign comprehension prediction.
User factors and cognitive sign features effects. As expected, sign comprehensibility depended on age, education level, and work experience. The present study depicted that adulthood and middle-aged construction workers have a much better perceptual performance than their older colleagues. The lower comprehensibility score in older adults (> 55 years) could be attributed to reduced attention and information-processing abilities 65 . Our results supported the previous work of Akple et al. indicating that people with a university or above education level possess better sign comprehension than the participants with an education level of less than high school 66 . Work experience, as another attribute, bore a relationship to the safety signs comprehensibility. There are investigations into construction safety signs and road warning signs that are consistent with our findings; suggesting that work experience can improve comprehension performance by increasing the frequency of encountering and familiarity with safety signs 6,67 .
In this study, the average scores of the five cognitive features were relatively close to each other but varied greatly from sign to sign. In line with the finding, Saremi et al. and Ahmadi et al. studies on pharmaceutical pictogram comprehensibility showed that the cognitive sign features differ widely from sign to sign 36,68 . For the "familiarity" feature, sign P5 (no smoking) was the most familiar sign and sign E12 (emergency stop) was rated as the least familiar sign, probably because the P5 sign is commonly seen in workplaces and public areas. For the "concreteness" rating, sign M3 (wear safety harness) and sign P6 (no naked flames) were assessed as the leastand most concrete, respectively. These results were consistent with the previous studies that concrete signs have obvious connections with the real world, while abstract signs consist mainly of shapes, arrows, and lines, and do not have such obvious connections 69,70 . Regarding sign "simplicity", P6 (no naked flames) was perceived as the simplest one while sign W16 (overhead crane) was perceived as the most complex, implying that the perceived simplicity of a sign was related to the number of elements in the sign 71 . For the sign "meaningful", sign E17 (fire extinguisher) and sign M3 (wear safety harness) were the most meaningful sign and the least meaningful ones, respectively.
Determining relevant components for prediction of safety signs comprehension using machine learning paradigm. Initially, all eight features were used for classification. It was found that the top three features i.e. familiarity, simplicity, and meaningfulness selected by 1R and Relief-F (RF) achieved the highest classification accuracy among all the possible combinations. Thus, for a fair comparison between different feature selection techniques, the top three features selected by them were used for classification. It was also observed that when only the top 2 features were considered, there is a drop in classification accuracy. Hence, the top 3 features were selected for each feature selection algorithm. Results indicate that when these three features were used for classification, the accuracy of the classifier reaches 94.369% under hold out data division protocol which is even higher than that using all eight features. This further indicates that insignificant and irrelevant features may misguide the classifier model thereby deteriorating its overall performance. Among different classifiers, the K-NN classifier outperforms others under different data division protocols followed by medium Gaussian SVM. In line with the present study, Cahigas et al. stated that symbol familiarity was positively related to safety sign comprehension 72 . Saunders et al. suggest that safety management systems should use familiar signs as much as possible 3 . Also, the safety management unit should take responsibility for the appropriate placement of safety signs in different sections of construction sites and provide sign training to workers with emphasis on the adverse consequences of not giving attention to the hazards that are represented by safety signs. Regarding sign simplicity, simple signs led to a higher comprehensibility score than complex signs. This finding suggests that the extraneous decorative parts of a safety sign may confound user comprehension 67  www.nature.com/scientificreports/ design should be simple and clear, especially when perceived at a distance 73 . Concerning sign meaningfulness, the comprehensibility scores were high for meaningful signs and low for meaningless signs, probably because meaningful stimuli are related to associated imagery and easily elicit meaning in one's mind 74 .
Using the ML approach, we have shown for the first time that the comprehension of construction safety signs can be classified and assessed regardless of the prejudice that usually exists in workforces based on exposure and previous experiences. The authors wish to extend the current study and use deep learning semantic approaches in AI to quantify subjective feedback to the comprehensibility of the construction safety signs. There is hope to make the signs as general and understandable to the wide audiences without mere bias. This study has several strengths. First, it used the standard protocols for safety signs comprehensibility and cognitive signs features assessment as well as conventional ML algorithms to maximize the performance improvements in terms of results and predictions. To the best of the authors' knowledge, no assessment is previously carried out to quantify the safety signs comprehensibility along with the evaluation of the accuracy of different ML algorithms in predicting safety signs comprehensibility and determining its most important predictors. However, the current investigation has a few limitations to note. The most significant one is the lack of transparency of ML algorithms that inherently characterize black-box ML models 75 . This means that the internal logic and inner workings of these algorithms are hidden from the user and will make a human (expert or non-expert) unable to verify, interpret and understand the reasoning of the system 76 . The current study used a series of the general ML algorithms with easy-to-understand structures and a limited number of parameters that are intrinsically transparent and can be interpreted without requiring additional explanation. As Occam's Razor 77 idea state the simpler model is, it may work and provide a more reliable outcome.

Conclusions
In this study, we managed to use users' factors and cognitive signs features for safety signs comprehensibility prediction in the construction industry using 14 machine learning models. In theory, we developed ML algorithms from three different supervised machine learning categories; namely, ensemble, neural network, and classical models. Various components of the ML paradigm like feature selection, cross-validation, classification, and performance evaluation were also implemented and examined. This study showed the role played by familiarity, simplicity, and meaningfulness in, respectively, enhancing and increasing safety sign comprehensibility. In practical terms, preventive training interventions could focus on the redesign of the actual working strategies and the adoption of engaging training methods as behavioral modeling in the use of machinery to optimize the learning of safety practices and safe behaviors. However, more study is required to confirm these findings on a larger and multi-centric database of cognitive design features among more safety signs. Large open-source databases of cognitive abilities, industrial conditions, and designing components are needed in the future to evaluate the performance of machine learning techniques in guiding the comprehensibility of the other safety signs. In the future, with a larger database, the performance of techniques used in this study can be compared with the performance of advanced classification techniques like a deep neural network. Generally, the use of a machine learning approach can be encouraged to determine which socio-demographic factors and cognitive sign features are important to predict safety signs comprehension in the construction industry. This would allow designers and practitioners to design construction safety signs based on the mental models approach to effectively convey their meaning clearly to prevent construction incidents occurrence.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.