Introduction

Cheiloscopy is a field of forensic odontology dedicated to the technical analysis of the human lips1. Dating from the 30’s, this procedure is carried out in the context of human identification2. More specifically, furrows on the vermillion of the lips are assessed based on their alleged distinctive pattern3. In practice, there is speculation about the uniqueness of lip print patterns4, ethnical variability5 and sexual dimorphism6.

Human identification methods must rely on scientifically acceptable tools7, such as fingerprint, dental and genetic analyses8. Authors of cheiloscopy studies suggest that the analysis of lip prints can support the identification process by narrowing down potential victims based on sex9. The contemporary scientific literature on cheiloscopy is vast and growing over time10,11,12,13,14,15. One of the “so-called” advantages of lip prints relies on the alleged unique patterns of furrows that will not repeat between different persons9. Authors also claim that lip prints can be found in crime scenes, especially on cigarettes, napkins and glasses9. Additionally, the literature points out that most criminals are currently aware of fingerprint analysis and how to avoid leaving such traces in a crime scene—their attention and concern, however, is not the same when it comes to lip prints9. Clear-cut furrows that run partially or completely across the lips seem to compose the most prevalent patterns of lip prints, but most of the prevalence studies are restricted to samples that are not even locally representative4. Reliable estimates of the presence of lip prints in crime scenes do not exist, but authors progressively endorse this biological trace as “frequent”8. Soon, studies on cheiloscopy will populate the scientific literature in forensic science and eventually this technique will be presented in Court as means to collect and analyse evidence. It is the role of science to carry out the scrutiny to (I) test the technique, (II) expose to per review, (III) calculate error rates, (IV) promote standardization, and (V) present to the scientific community to verify whether the technique is acceptable—all steps inherent to Daubert’s standards.

Considering the existing gap reflected by the uncertainty that surrounds the usefulness of lip print patterns and the urgent need to promote evidence-based science, this study was designed to screen the scientific literature with a systematic approach to find out the real value of cheiloscopy for sexual dimorphism. Prevalence rates of lip print patterns and diagnostic accuracy were the targeted as qualitative and quantitative outcomes of interest.

Materials and methods

Protocol and registration

This systematic review was performed according to the (1) PRISMA guidelines (Preferred Reporting Items for Systematic Reviews and Meta-Analyses)16, (2) the PRISMA standards for Diagnostic Test Accuracy17 and (3) the JBI Manual for Evidence Synthesis18. The research protocol was submitted for registration at the PROSPERO database.

Focused question

The systematic review followed the acronym PIRD which stands for population (P), index test (I), reference test (R) and diagnosis of interest (D). The guiding research question was: “Is there evidence to determine the biological sex (diagnosis of interest/reference test) of patients free of pathological and/or genetics changes of the lips (population) using cheiloscopy (index test)?”.

Eligibility criteria

Only observational (cohort, case–control and cross-sectional) and diagnostic test accuracy studies were included. No restriction was applied regarding the year or language of publication. The exclusion criteria consisted of studies lacking evident information about the technique used for cheiloscopy, cadaver studies and studies with individuals that had genetic/pathologic alterations of the lip.

Data source and search

The systematic search was performed in August 2020. The primary data sources were Embase, LILACS, PubMed (including MEDLINE), SciELO, Scopus and Web of Science. To avoid/reduce publication bias OpenThesis, OpenGrey and Open Access Theses and Dissertations (OATD) were used as data sources to partially retrieve the grey literature.

Medical Subject Headings (MeSH), Descriptors in Health Sciences (DeCS) and Emtree (Embase Subject Headings) terms were combined by the Boolean operators AND/OR to build search strings (Table 1). Search terms were adapted for each database.

Table 1 Strategies for database search.

Study selection

Initially, studies were identified after a literature search in each of the databases and imported into EndNote Web (Thomson Reuters, Toronto, Canada) (https://www.myendnoteweb.com) software to remove duplicates. Remaining studies were written down in Microsoft Word 2016 (Microsoft Ltd, Washington, USA) to manually remove duplicates. Next, a training exercise was proposed to reviewers to achieve proper agreement during the following phases. The reviewers analyzed 20% of the studies based on the eligibility criteria. The aimed agreement rate was at least 81% (Kappa ≥ 0.81). After training, they were able to perform study selection based on title reading (reviewers were not blind for the authorship and year of publication). The next phase consisted of abstract reading and systematic selection. Studies without abstracts available were not excluded in this phase. Finally, the selected studies underwent full-text reading. Studies excluded in this phase had their reason for exclusion registered separately. During all the study selection process, a third reviewer was enrolled to solve any lack of agreement between the two reviewers.

Studies in which the full text could not be retrieved were requested to the authors by e-mail. Additional support was obtained from the Brazilian Program of Bibliographic Commutation (COMUT) and from the Brazilian Institute of Information on Science and Technology (IBICT). In case of studies published in languages other than English, Portuguese and Spanish, the full text was translated.

Data extraction

Data extraction was performed by two examiners independently. A template Microsoft Office Excel (Microsoft Ltd, Washington, USA) sheet was used to assure standardized data extraction. The following data were extracted: (I) identifying information—authorship, year and country of publication of the eligible studies; (II) sample profile—size, age interval, sex distribution and geographic region of origin; (III) cheiloscopy-related data—technique used for analysis, general and sex-related lip print patterns, and sensitivity and specificity of cheiloscopy for sexual dimorphism. Data extraction was supervised by a third reviewer and a forensic odontologist.

The corresponding authors were contacted by email (up to three times over two weeks) to obtain relevant information in case of missing or unclear data.

Risk of bias

The risk of bias and the assessment of individual methodological quality of the eligible studies were accomplished by means of JBI Critical Appraisal tool for observational cross-sectional19 or diagnostic test accuracy20 studies. Following PRISMA16, two reviewers assessed the risk of bias. Lack of agreement between reviewers for any of the questions within the JBI tool was solved by a third examiner.

The percentage of positive answers to the questions led to the final score of the studies. Studies that scored up to 49% of positive answers were classified as “high risk of bias”. Studies with positive answers between 50 and 69% were classified as “moderate risk of bias”, while studies that scored positive answers above 70% were classified as “low risk of bias”.

Summary measures

The outcomes were explored by means of descriptive analysis and were presented in narrative tables. The prevalence of lip print patterns was reported according to sex and compared between males and females. More specifically, this analysis was performed using a meta-analytical approach of proportions, in which combined prevalence estimates for males and females were estimated using random effects and Freeman-Tukey double transformation to stabilize the model's variances21. The heterogeneity between groups was estimated to assess the differences of lip print patterns between males and females. A meta-analysis was adjusted for each combination of lip print pattern, lip side (right/left) and lip position (upper lower). Studies with missing information about lip print pattern, lip side and lip position were not included in the meta-analysis. The meta-analysis was performed separately for the two predominant techniques found in the systematic literature review: Suzuki & Tsuchihasi (1970) and Renaud (1973).

The diagnostic accuracy of the cheiloscopy technique for sexual dimorphism was tested separately for males and females. The absolute number of correct match and mismatch between reference and target lips was extracted from each eligible study and a meta-analysis using random effect was adjusted. To avoid the exclusion of studies that reported zero match or mismatch, a correction of continuity of 0.5 was established in these cases. Studies that provided the number of hits and errors for males and females separately were included in a meta-analysis evaluating if the accuracy of cheiloscopy differed in distinguishing males and females. To assess that, the odds ratio for identifying males compared to females was calculated, and it evaluated if the methods was more or less accurate for sexual dimorphism among males compared to females.

For meta-analyses that included at least 10 studies, publication bias was investigated through Egger’s test by a linear regression of the effect measure on the size of the study22. Statistical analyses were performed with Stata version 16.1 (StataCorp LLC, College Station, TX, USA) software. Significance level was set at 5%.

Certainty of evidence (GRADE approach)

Certainty of evidence and strength of recommendation were assessed with the Grading of Recommendation, Assessment, Development, and Evaluation (GRADE) approach. According to this system, diagnostic accuracy studies start at a high level of certainty and can be downgraded based on risk of bias, inconsistency, indirect evidence, imprecision, and publication bias. The level of certainty among the identified evidence was characterized as high, moderate, low, or very low23.

Results

Study selection

The first phase of study selection resulted in 3,977 studies throughout the nine electronic databases. After removing duplicates, the remaining number of studies was 2,956. Exclusions based on title and abstract reading reduced the sample to 98 studies eligible for full-text reading. Six studies did not fulfill the inclusion criteria (Appendix 1), and full texts were not found for twenty studies, even after trying to contact the authors or libraries. Finally, a total of 72 studies were selected for qualitative analysis1,2,4,5,6,10,11,12,13,14,15,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84. Quantitative analysis of the accuracy of cheiloscopy for sexual dimorphism included seven studies1,5,25,48,49,54,80, and 17 studies10,11,14,28,30,32,34,36,38,42,43,51,56,60,61,63,82 were considered in the analyses of the prevalence of lip print patterns (Fig. 1).

Figure 1
figure 1

Flowchart diagram, following PRISMA, describing the quantity of studies filtered from identification to the final inclusion in the qualitative and quantitative (meta-) analyses.

Characteristics of eligible studies

The studies were published between 1982 and 2019, and were from India (n = 52)1,4,5,6,10,11,12,13,14,15,25,27,30,31,35,36,37,38,39,40,43,44,45,46,48,49,50,53,54,55,56,57,59,60,61,62,64,66,67,68,69,70,71,72,73,74,75,78,79,81,83,84, Egypt (n = 3)2,42,58, Brazil (n = 3)26,34,76, Portugal (n = 2)32,51, Pakistan (n = 2)47,77, Colombia (n = 2)29,52, Nepal (n = 2)33,82, France (n = 1)24, Iran (n = 1)63, Romania (n = 1)41, Croatia (n = 1)65, Saudi Arabia (n = 1)28 and Poland (n = 1)80. The total sample of participants across studies was 22,965. The age interval of the of participants ranged from 1 to 83 years (Table 2). Fourteen studies did not describe the ethical aspects adopted in the study. None of the cross-sectional studies reported STROBE checklist as the guideline of choice.

Table 2 Main characteristics of eligible studies.

Sixty-four studies1,2,4,5,6,10,11,12,13,14,15,25,26,27,30,31,32,33,34,35,36,37,38,39,40,41,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,61,62,63,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,81,82,84 used the technique of Suzuki and Tsuchihashi (1970), four studies28,29,42,60 used Renaud’s (1973) technique, one study24 used Fauvel’s (1985) technique, one study64 used Nagasupriya’s (2011) technique, and one study80 combined the techniques of Suzuki and Tsuchihashi (1970), Renaud (1973), and Vahanwala (2000). One study83 did not report which technique was used. In general, twenty-four studies (33%)12,14,24,26,28,31,35,42,43,44,47,51,53,54,55,57,63,66,67,74,75,76,82 did not find evidence of difference of lip print patterns between males and females, while 67%1,2,4,5,6,10,11,13,15,25,27,29,31,32,33,34,36,37,38,39,40,41,45,46,48,49,50,52,56,58,59,60,61,62,64,65,68,69,70,71,72,73,77,78,79,80,81,83,84 detected differences.

Individual risk of bias

Fifty eligible studies2,4,5,10,14,25,26,28,29,30,32,35,38,39,40,42,44,45,46,49,50,51,52,53,54,55,56,57,58,60,61,62,63,66,67,68,69,70,71,72,74,76,77,78,79,81,82,83,84 had low risk of bias, while 22 studies1,6,11,12,13,15,24,27,31,33,36,37,41,43,47,48,59,64,65,73,75,80 had moderate risk of bias (Tables 3 and 4). All the questions in JBI tool for cross-sectional studies were applicable, while three questions were not applicable in the JBI tool for diagnostic test accuracy studies.

Table 3 Risk of bias assessed by the Joanna Briggs Institute Critical Appraisal Tools for use in JBI Critical Appraisal Checklist for Diagnostic Accuracy Studies.
Table 4 Risk of bias assessed by the Joanna Briggs Institute Critical Appraisal Tools for use in JBI Critical Appraisal Checklist for Analytical Cross Sectional Studies.

Regarding cross-sectional studies, questions #5 and #6 had a negative answer in 25 studies6,11,12,13,15,24,26,27,30,31,33,35,36,39,41,43,47,50,55,57,64,66,67,68,71,73,83. These questions verify if the study identified and avoided confounding factors, since studies should minimize the risk of bias describing factors that could influence on the process of collecting lip print evidence. In 28 studies2,6,10,11,13,15,24,26,27,30,31,33,35,36,39,41,43,47,50,55,57,64,66,67,68,71,73,83 question #7 had a negative answer. This question has a direct impact in the quality of the evidence because it verifies if the outcomes were obtained in a reliable way. An example of attitude towards a positive answer is the minimization of bias by describing the process of intra- and inter-examiner training.

Concerning diagnostic test accuracy studies, questions #1 and #2 were marked as ‘unclear’ or ‘no’ for all studies1,5,25,48,49,54,80. The first question checked whether the sample was selected consecutively or randomly. The second question was related to the methodological design of the studies; all studies recruited participants that were already known, by other means, to have the diagnosis of interest and investigated whether the test of interest correctly identified them as such. Moreover, question 4 was marked as 'unclear' for three studies1,48,80 that did not provide details regarding blindness of the index test.

Synthesis of results

Primary outcome—accuracy for sexual dimorphism

Seven studies1,4,25,48,49,54,80 were included in the meta-analysis of the accuracy of lip prints for sexual dimorphism. Out of the seven studies, nine accuracy assessments were included in the meta-analysis—since the study by Topczyłko et al.80 evaluated three different methods. The overall accuracy was 76.8% (95% CI = 65.8; 87.7, I2 = 97%) (Fig. 2). Individual accuracy rates ranged from 52.7 to 93.5%.

Figure 2
figure 2

Overall compilation of accuracy rates across seven eligible studies that reported the sufficient data for quantitative analysis.

Six out of the seven studies included in accuracy meta-analysis provided the number of hits and error according to the sex of the patient and were included in a meta-analysis that assessed if the odds of distinguishing males was different compared to the odds of distinguishing females. Overall, there were no differences to diagnose males compared to females (OR = 0.71; 95% CI = 0.26; 1.99, I2 = 85%). Only specific studies, such as Kaul et al. (2015)53 and Nagalaxmi et al. (2014)48, described differences for sexual dimorphism (Fig. 3). The first showed 77% higher odds of identifying females compared to males (OR = 0.23; 95% CI = 0.27; 0.31), while the second showed sixfold higher odds of identifying males compared to females (OR = 6.00; 95% CI = 1.17; 30.72). One study80 did not report samples divided by sex and was not included in the analysis.

Figure 3
figure 3

Odds ratio depicting the accuracy of cheiloscopy for distinguishing males from females. Random-effects model applied within six eligible studies.

Secondary outcome—prevalence of lip prints

According to the technique of Suzuki and Tsuchihashi (1970), lip print pattern type 2 was the most prevalent (> 30%), while type 5 was the rarest pattern (< 3%) (Table 5). Sex differences based on prevalence rates were not detected. Publication bias was identified for studies analyzing lip print type 1’ for the upper and lower dental arches on the right side, for lip print type 4 for the upper arch on the left and right sides, and for lip print type 4 for the lower arch on the right side.

Table 5 Lip pattern prevalence according to sex and dental arch for Suzuki and Tsuchihashi’s method for cheiloscopy classification.

Sex differences were not observed using Renaud’s (1970) technique. According to this technique, the most prevalent pattern was type C (> 12%), while type I was the least prevalent (< 1%) (Table 6).

Table 6 Lip pattern prevalence according to sex and dental arch for Renaud’s method for cheiloscopy classification.

Certainty of evidence

GRADE approach showed low certainty of evidence. The limiting aspects were the lack of consistency between the estimated effects and the lack of overlap of confidence intervals—evidenced by the increased heterogeneity between the included studies (Table 7).

Table 7 Grading of Recommendations Assessment, Development, and Evaluation (GRADE) summary of findings table for the outcomes of the systematic review and meta-analysis.

Discussion

Dental analysis, within forensic dentistry, figures as an alternative for human identification especially because of the resistance of human teeth to high temperature and cadaveric alterations85. Over time, several forensic applications were studied for the use of dental/oral evidence. Apart human identification, bite mark analysis86 anthropological estimation of age87, sex88, stature89 and ancestry90; rugoscopy91 and cheiloscopy92 currently represent fields of forensic odontology. While some fields developed with strong scientific basis and broad legal acceptance (i.e. human identification), other fields remained controversial and lacked high-level evidence-based confirmation—this is the case of cheiloscopy. From the perspective of forensic practice, the alleged contribution of cheiloscopy relies on the possibility of retrieving identifying information (such as sex) from a suspect from visible or latent lip prints left in a crime scene93. Two main controversies might arise from cheiloscopy: (I) in crime scene investigations, the existing lip print left on objects or other surfaces could enable higher evidence toward human identification through DNA extraction instead of comparative analysis of furrows; (II) studies on cheiloscopy are generally observational, cross-sectional and with questionable settings that include different techniques, underlying surfaces and registration materials (e.g. lipsticks and powdered metals). In this scenario, several questions are pertinent: Why the scientific literature is so vast of studies on cheiloscopy for sexual dimorphism? How often is cheiloscopy used by forensic dentists in practice? But especially (claimed in many studies): Is cheiloscopy really useful to distinguish male and females in forensic dentistry?

To the present, there is no antemortem database of lip patterns worldwide (even in clinical dentistry). Moreover, registering the lips with photographs or other tools is rare—so, the application of cheiloscopy for human identification is limited from the beginning. Striving for sexual dimorphism could be an interesting asset to the armamentarium of forensic dentists, but again the application in practice is relative, especially because dental human identification is mainly necessary in challenging cases that involve charred bodies and skeletal remains94—in which lips are usually destroyed. Additionally, sexual dimorphism should be accomplished from body structures scientifically known for their anthropological reliability, namely the pelvic bones and skull95.

The evidence brought through the present systematic review was extracted from 72 studies that sampled 22,965 individuals. Out of the studies, 70% (n = 52)1,4,5,6,10,11,12,13,14,15,25,27,30,31,35,36,37,38,39,40,43,44,45,46,48,49,50,53,54,55,56,57,59,60,61,62,64,66,67,68,69,70,71,72,73,74,75,78,79,81,83,84 were from India. At first sight, the quality of studies was not bad when it comes to assessment of the risk of bias (nearly 70% had low risk of bias). These outcomes combined with the general quantification of the studies that detected sex differences based on lip pattern (67%) could lead to dangerous interpretations from readers that are not familiar with systematic reviews. A deeper look on the quantified outcomes of the most prevalent techniques (Suzuki & Tsuchihashi, 1970, n = 64, 88%; Renaud et al., 1973, n = 4, 5%), however, depicts an emerging lack of statistical significance (p > 0.05) for each lip pattern between males and females. The analysis performed per pattern clarifies the scenario as most of the studies in the field only test sexual dimorphism by comparing generalized (combined) patterns within sex groups (males vs. females). Further on, the limitations of cheiloscopy for sexual dimorphism is corroborated by GRADE assessment outcomes, which pooled seven studies (10% of selected studies) and 1,547 participants to clearly point out high heterogeneity (> 75%). The heterogeneity might be justified mainly because none of the 72 observational eligible studies reported data using scientifically established guidelines, namely STROBE. The resulting analysis via GRADE suggested low level of general quality and critical level of importance. Considering the diagnostic accuracy of cheiloscopy, mean outcomes point to 76%, which indicates that one in every four analysis of sexual dimorphism through lip patterns will have a wrong classification. Stronger outcomes would necessarily require a higher level of accuracy and a lower level of heterogeneity across studies. Summed up, the eligible studies screened and assessed in the present systematic review showed a good performance of cheiloscopy when the studies were analyzed separately; but when it comes to deeper analyses, especially observed per lip pattern within the techniques, lack of evident differences were detected between males and females. The limitation of cheiloscopy is, therefore, corroborated with the final quantitative assessment via GRADE.

To the present, the alleged contribution of cheiloscopy in forensic dentistry is merely superficial and highly relative. The quantification of the potential error within the diagnostic accuracy of cheiloscopy would be close to 25%—in other words, nearly 386 participants sampled in the quantitative part of this review would have their sex wrongly classified from a sample of 1547 individuals. Forensic dentistry itself is already a relative tool for human identification (not necessarily applicable in every single autopsy). In general, charred victims and skeletal remains consist of the main scenarios for a forensic odontologist. Authors might claim lip print applications to narrow disaster victim identification lists by sex, but in most of these cases bodies are not intact. If the case is somehow improving cheiloscopy studies in the future, authors are encouraged to design more advanced analyses of the morphology of the human lips to the point of having enough evidence to support the development of clinical databases and protocols for lip recording. From the perspective of forensic practice, this systematic review does not encourage the use of cheiloscopy as the sole tool for sexual dimorphism.

Conclusion

After revisiting 72 eligible studies with a pooled sample of 22,965 individuals, this systematic review revealed weak foundations for the use of lip print analysis for sexual dimorphism in forensic dentistry. The pooled sampled reduced within the meta-analysis showed an average rate of wrong sex classification of nearly 25%. The studies were highly heterogeneous as none of them followed proper EQUATOR guidelines for structuring methods and reporting data. GRADE analysis confirmed the low certainty of evidence suggesting that cheiloscopy is not a reliable tool in practice when it comes to sexual dimorphism.