Introduction

Safety of clients’ money and data (e.g. transactions) is at the heart of banking culture and reputation. As one of the instruments to safeguard clients, banks use polygraph screenings (PS). These are performed when hiring candidates to prevent the hiring of untrustworthy people. To detect an infringement early, employees with sensitive roles are screened regularly. The PS topics include drug abuse, gambling addiction, insider trading, disclosure of confidential information, bribery, corruption, and misappropriation and fraud (sample screening questions are in Suppl. Table 2). The finance industry is not alone in applying PS; other examples being critical sectors such as aviation, manufacturing companies, and federal law enforcement agencies throughout the world1,2.

The classical polygraph is a device that records cardiovascular activity (such as heart rate), thoracic and abdominal respirations, galvanic skin response (a.k.a. electrodermal activity, or EDA), and tremor. An examiner asks questions of, and accepts «yes» or «no» answers from, the person being screened (examinee). There are many good overviews of classical polygraph and questioning methods3,4,5.

Unorthodox lie detection studies analyze video and audio6 (including facial expressions7,8, pupil reaction9, and delays between question and answer10), electromyography (EMG)11, electroencephalogram (EEG)12, magnetic resonance tomography (MRT)13,14, or writing pattern (keystroke dynamics)15 in addition to or instead of classical polygraph data. Some of these studies even get a chance to pilot in the new fields, such as the iBorderCtrl lie detector pilot in EU airports16,17 or the VeriPol deception detection pilot by Spanish police on written insurance claims18,19. Yet, in the traditional fields, we are unaware of any cases where classical polygraphs are substituted with unorthodox systems. Classical polygraph remains the instrument of choice in the traditional areas, such as hiring screening, and criminal and internal investigations.

Polygraph has a long history of drawing criticism from psychology20,21 and law scientists22, as well as from the public and state1,23. A major concern is that this method does not detect lie and truth reliably. And yet, “paradoxically, although Congress expressed deep concerns about the efficacy of the technology, the EPPA permits the use of lie detectors in circumstances in which the accuracy of the results is of paramount importance: national defense, security, and legitimate ongoing investigations22.

Critical related work provides many arguments for why polygraph screening may fail at detecting a lie or mark a truthful answer as a lie. For example: “Polygraph tests do not assess deceptiveness, but rather are situations designed to elicit and assess fear24. A truthful junior manager may fear being called a corruptor more than a coldblooded, corrupted senior manager fears being caught lying by a polygraph examiner. Another example of constructive critique is a grounded call for standardization of polygraph screening procedures and examiner education25. Of all concerns, in this paper we tackle only one: the need for quality assessment (QA) of examiner work. Examiner errors happen, for example, when an examiner is inexperienced, exhausted or distracted, or biased26.

A simple QA solution exists: always have another examiner review the screening and confirm or disprove the conclusions of the original examiner27. To QA a polygraph examiner report, another examiner needs to review the recording of the screening, including the polygram (a graphical representation of recorded sensor data coupled with the examiner’s questions and the examinee’s answers), sometimes audio and video recording, and to compare his conclusions with the original report. In our experience, QA takes at least half the time it took to perform the screening. An average screening takes at least two hours. Thus, QAs are costly in terms of both time and money. For this reason, and to the best of our knowledge, industrial internal security departments QA screenings infrequently or not at all. We also note that having other examiners to QA all screenings is not a bulletproof solution. Some examiners mistakes come not from the examiner’s bias or fatigue, but from the fact that the case is hard. In hard cases, the second examiner may make just the same mistake the original examiner did.

The overview of our experimental framework is as follows. Our main approach is to train a binary classifier model and apply it to each of the real 2094 screenings in our possession to see if the model score contradicts the examiner conclusion. To avoid applying the model to screenings that it saw during the training, we use standard stratified fivefold validation. Here we hypothesize that the model will not train to make the errors that human examiners do because the share of examiner errors is minor. We also decide to deviate from related work by not implementing polygraph examiner rules as features. We do this to avoid our model being trapped in the same way that human examiners are trapped, when some rules are disputable and have exceptions. Our secondary experimental goals are as follows:

  1. a.

    Ideate and test novel features that would uplift the AUC of the models. In particular, consider features of novel (not physiological) nature, such as job description and magnetic storms.

  2. b.

    Build models for each screening topic individually, to see if this uplifts the quality and if the AUCs of models differ from topic to topic.

The primary success benchmark of our experiments is the success in finding real examiner errors in the screenings marked by the models we trained. We implement this primary benchmark by handing the screenings flagged by the model as examiners’ errors to the human examiners for verification. The secondary benchmark that we used during the modelling process is the AUC of the models. This secondary benchmark reflects only indirectly how good the models are at catching the examiners’ errors because AUC is calculated on noisy targets containing these unknown errors. The third benchmark is the best AUC of most related work (0.85 by Slavcovic4). This AUC can be considered an upper bound, because it is obtained on criminal investigation polygraph data, known to have much more predictive power than job screenings3.

Here we report devising and testing in the field an ML tool to QA the examiner reports for PS performed on classical polygraph. A small number of reports, marked suspicious by this tool, will be handed to another examiner for QA. Such a tool would allow for semi-automatic double-checking of all new and historical reports, without hiring additional examiners. An additional advantage of such a tool is that if examiners are sure all their work will be QA-ed, they will make decisions more carefully.

Our results neither justify nor solidify the practice of classical polygraph screenings. Rather, we consider our results as a temporary and partial patch that helps to eliminate a specific type of error of this method, until better methods are devised and put into practice. More broadly, we believe we make a step towards rethinking classical polygraph practices.

Below we describe the steps, from a basic model to a validation of the final model, that succeeded at exposing real examiner errors in historical field screenings.

Results

Basic second-opinion model on examiner conclusions

We built a basic second-opinion model by training a model on the historical data of 2094 field polygraph screening recordings (PSRs) including Deception Indicated (DI) attributes set by the examiners who conducted the screenings. The intended use is to raise a red flag whenever an examiner conclusion contradicts the conclusion inferenced by the model.

We present the quality metrics of the basic model in Table 1a, in column «all topics». Our major quality metrics are ROC AUC, and TPR for an FPR at 0.05. We note that we are forced to use these indirect quality metrics, because they measure how the model mimics the conclusions of the examiners. In fact, counter-intuitively, and contrary to the goals of related work, we do not want a perfect model predicting examiner conclusions in up to 100% screenings, because then we would flag no candidates for examiner errors. In practice, we are interested in validating that the model detects erroneous examiner conclusions, of which at this stage we had absolutely no knowledge. Tables 2 and 3 present the importance of each of the features based on 10 raw polygraph signals and age and sex of the examinees; the details of second-level feature construction are described in the section on “Methods”. Figure 1 depicts where the model and an examiner’s conclusions agree and disagree, depending on the model score.

Table 1 Three models inference each of seven screening topics, measured using stratified K-fold cross-validation.
Table 2 Feature importance of the first-level basic model.
Table 3 Feature importance of the second-level basic model.
Figure 1
figure 1

Distribution of basic model scores for topics: (a) “Drug abuse”; (b) “Corruption”.

We also decided to measure the quality of the model applied to each of seven screening topics (Table 1a). To the best of our knowledge, we are the first to report the model quality on separate topics within a standard employee screening, and this approach will help us as shown below.

Using alternative data to improve the model quality

Measuring the performance of a model involves spending approx. 100 examiner hours, because it involves a real examiner QA-ing thoroughly dozens of PSRs flagged by the model. In case of failure, i.e. in the case of finding no examiner errors in the flagged screenings, no second chance would be given to waste another 100 h of the limited resource. We tried to maximize our one chance by doing our best to increase the quality of the basic model before flagging suspicious conclusions for manual QA.

We hypothesized that information about geomagnetic storms on Earth28 and weather conditions in the city on the date of the screening may help the model to predict. The intuition behind this assumption is that, during storms and under different weather conditions, humans might behave slightly differently, resulting in slightly different raw physiological measurements or the sensors might provide slightly shifted measurements, or both.

We also collected examiner ID, hoping that these data may be of help to the model, because different examiners might provoke slightly different physiological reactions in examinees, or the polygraph devices assigned to each examiner might have slightly different signal measurement deviations. We also collected roles (e.g. job positions) of the examinees, because people of different education and training may tell the truth and lie differently.

Table 1b presents a basic model re-trained with these alternative data and Table 4 shows the importance of these features (full description of all features is in Suppl. Table 6). The uplift each data source is providing to the model based only on physiological signals is shown in Table 5.

Table 4 Feature importance of the second-level basic model with alt data.
Table 5 Uplifts of AUC from each alternative data source for entire dataset, and for each topic.

All alternative data types uplifted the quality of the model; however, we decided to keep only age, sex and job roles for production (Table 1c). Weather showed anomalously high uplift and importance, and we feared that this is because, for technical reasons, our dataset is highly unbalanced by the percentage of DI labels per city. To exclude city bias, we cut the dataset to one city but weather still was high in feature importance. Thus, we believe weather is significant alternative data. However, on the full dataset, weather could leak city information, and the model could get the city bias from the unbalanced dataset. While examiner ID provided a moderate uplift, the nature of this feature requires further investigation before relying on it in production. For example, if it is not the examiner ID but the examiner’s polygraph device ID that helps inferencing, then we will have wrong scoring when an examiner changes his device.

A model built for one topic performs marginally better

Here we investigate if training a separate model for each topic will result in even better quality as compared to the basic model with job position. We had DI labels enough for training for one topic only, i.e. for drug abuse (137 DI labels). Table 6 shows that we gain a + 2% ROC AUC (6% relative uplift) if we train a model for the drug abuse topic only. This allows us to speculate that people may lie differently on different topics, and thus separating the topics makes it easier for the model to learn and to inference. More data and research are needed to confirm this hypothesis.

Table 6 A model built for “Drug abuse” topic inferences each of seven screening topics, measured using stratified K-fold cross-validation.

A model built on one topic can handle other topics with varying quality

We applied the Drug Abuse model from the previous paragraph to inferencing the other six topics (Table 6). Compared to the universal model, the performance of the Drug Abuse model varies from topic to topic. We conclude that a model trained on one topic can handle other topics, albeit with insignificant quality degradation for some topics.

Vague questions are hard not only for people and examiners, but for models too

In Table 1 we observed that the basic model performs on some topics (such as drug abuse and criminal history) significantly better than on other topics (such as unreported income or IRD violation). This observation is in line with a long-standing issue in the screenings: people just cannot confidently answer questions when they are not sure about the answer. At the bank, we have hundreds of IRDs, dozens of pages each, not to mention the versioning, so some people are not sure if they never violated a single IRD. Similarly with unreported income: some people start asking themselves questions like «if I got cash from a relative, is this an income?», etc. As with any common knowledge without quantitative proof, there were heated debates whether topics like «IRD violation» are effective or need to be more specific. Our finding helped to end this never-ending discussion in our organization.

We also could use this observation to improve the quality of the basic model. If we found a couple of topics that confuse people, the basic model must be confused training on these. We tried to remove these confusing labels from the trainset all together. However, and counter-intuitively, in Table 7 we can see that this idea did not improve the quality of the model significantly, and the quality of the inferencing for confusing topics dropped or did not change. We presume we did not observe a significant positive effect because the share of confusing topic DI labels in the trainset is insignificant.

Table 7 Basic model without “IRD violation” and “Unreported income” topics in the train set.

Topic as alternative data

The universal model we built and described above does not use topics as features for training and inference. The reason is that we sought a model that can score any screening topic, not just the seven topics we have training data for. In Suppl. Table 5 we report how topics used as additional data for building a universal model reflect on the quality of the model. We can see that knowing topics helps the model to better score «confusing» IRD topic whereas other topic quality remains unchanged.

Before adding topic labels as alternative data, we balanced the dataset with regard to topics. This balancing resulted in cutting the drug abuse topic with DI labels fourfold. We observed in Suppl. Table 5 that this cut decreased the inferencing quality of the DA topic, which had always been an unexplained leader before. The quality of the DA topic became on par with a couple of the forerunning topics, such as corruption and criminal history. This observation explains the previous domination of the DA topic; it was because it benefited from a significantly larger minority class (DI label) than other topics.

Ensembling and extra data

We measured the uplift from adding fresh 189 DIs and also experimented with various model ensembling architectures. The results are displayed in Table 8. Cumulatively, ensembling and extra data lifted AUC by 5% on all topics, and up to 11% on selected topics. Ensembling is explained in “Methods”.

Table 8 The impact of ensebling and extra data on AUC.

Validating ML-based second-opinion in the field

We now have two advanced models: a Universal model (ensemble with alternative data), and a Drug Abuse model (one topic model with alternative data). Here we report the summary of the test to find examiner errors among 2094 field historical screenings. We selected screenings where the examiner concluded NDI, but a model voted for DI strongly.

Based on Drug Abuse model top scores, we selected 15 NDI examiner conclusions as candidates for examiner errors on drug abuse topic. Similarly, based on Universal model top scores, we selected 15/5/5 NDI conclusions on corruption/confidential information leak/criminal history topics. Thus, we ended up with 40 conclusions (candidates for examiner errors) in 36 screenings.

We handed these 36 screenings for thorough blind QA to two examiners. The examiners did not know the results of the screenings, and did not share their QA results with each other. The reason for performing two QAs is because if it happened that there was a discrepancy between the original conclusions and one QA, we would have the word of one examiner against the word of another examiner, which would not constitute an original examiner error per se. One screening (one conclusion on corruption topic) was later removed from QA procedure for technical reasons.

We have extremely experienced examiners, and there is a common assumption at the bank that the examiner error rate might be anywhere between 0.0 and 1.0% of all screenings. An examiner error is an extraordinary and critical situation that nobody remembers happening once during QAs performed from time to time for years. Thus, our test success criteria was to find at least one examiner error in 39 conclusions inside 35 screenings.

The summary of the two QAs is presented in Table 9. The distributions of scores for two relevant topics are shown in Fig. 2.

Table 9 Results of two QAs of the top candidates for examiner error (EE) in 2094 screenings.
Figure 2
figure 2

Distribution of basic model scores for topics: (a) “Drug abuse”; (b) “Corruption”.

By QA-ing 39 examiner conclusions in 35 screenings (out of 2094 screenings) we identified 30 problematic conclusions, where either plain examiner errors are confirmed by two QAs (13 conclusions) or where QAs do not agree (17 conclusions). The remaining 9 conclusions are model errors, where an original examiner did not make a mistake as confirmed by both QAs. We expected that there would be some cases of QAs not concurring because some examiner mistakes could be hard calls where a decision is not obvious. For such hard cases, usually a concilium is called where examiners discuss their conflicting conclusions and come to an agreement. In this context, we are satisfied that a significant portion of the test (17 of 39 conclusions) has ended up in a concilium. It is dangerous to rely on hard call conclusions that require a concilium, without a concilium. We do not publish the results of the concilium since the results do not contribute towards the results of the paper.

We note that in several problematic screenings (DI set by one or both manual QAs), the examiners who conducted the QA made side notes that an examinee practiced counter-measures. Thus we conclude that our models catch some counter-measures. Missing a counter-measure is an examiner error by definition, but we were not sure we would catch anything beyond trivial errors.

We conclude that our models are fit for a one-year pilot, where 100% inflow of new screenings (approx. one hundred a day) will be scored. Manual QAs will be mandated in case of conflict between examiner conclusions and model scores on the topics. The exact model threshold will vary during the pilot, in part depending on the current load of the examiner team. The pilot will start at the end of 2022, after interfacing with a production polygraph report system is completed.

Discussion

Slavcovic’s work on analyzing raw polygraph data twenty years ago remains the most relevant to our work. Similarities are that: (i) both studies work with raw classical polygraph data, (ii) both sets of data are collected in the field as opposed to data collected from volunteers instructed to lie, and (iii) accuracies of our models are roughly equivalent.

We differ in the following:

  1. i.

    Slavcovic warned that data may contain examiner errors; we aimed at finding such errors and succeeded at catching examiner mistakes in historical records.

  2. ii.

    We experimentally drew the lower bound of examiner error rate in the field (≈1.5%). This bound did not previously exist to the best of our knowledge. We believe this finding will provide factual motivation for QA.

  3. iii.

    We showed the promise of novel data sources for accuracy of lie detection models, including examiner ID, examinee job role, wind, atmospheric pressure, and geomagnetic storms.

  4. iv.

    We make part of the data accessible by reasonable request to facilitate academic research.

  5. v.

    Data are of very different nature. Our data are hiring and regular screenings of civil personnel as opposed to Slavcovic's data from army criminal investigations.

  6. vi.

    Distributions of classes (lie detected/not detected) differ significantly; our number of records is an order of magnitude higher; the classical polygraphs are of different manufacturers and decades; data sampling rates are 31 Hz vs 60 Hz (ours vs Slavkovic’s).

  7. vii.

    Neither Slavkovic nor any work we know of looked into differentiating topics at training and at inferencing times. We showed that one can profit from both.

Differences v and vi make it hard to compare model accuracies; the nature of the data is very different. Even so, we were surprised that we did not achieve significantly better accuracy twenty years later. We agree with related work in that criminal investigations are easier to classify than routine civil personnel screenings3. Thus, we may have a significantly better model, but AUCs are on par with Slavcovic because emotions in our dataset are harder to classify, and because civil screening questions are much broader than criminal investigation questions.

Honts and Amato recruited 80 volunteers to mock lies and truthful answers in the screenings29. In half of the screenings, volunteers watched videotaped questions instead of an examiner asking questions, and a special algorithm (RI Score) scored the answers instead of an examiner evaluation. Honts and Amato conclude that the automated screening scenario was more accurate than that carried out by a human polygraph examiner. Honts and Amato neither aimed at finding, nor found, any examiner errors. We differ in that Honts and Amato automate the screenings while we automate examiner conclusion verification. We do not substitute examiner with automated scoring. Moreover, in our setup, we do not show the examiner the scores of our tool (to exclude the possibility of the tool results influencing the conclusion of the examiner). The RI Score is rule-based and apparently requires additional markup by an examiner to calculate, whereas our ML models need no markup in addition to NCCA ASCII standard.

Mambreyan et al. show that artificial bias in data with regards to sex leads to overestimating the quality of deception detection models running on video30. They refer to a work that built a model on a dataset of videos, where 65% of woman and only 27% of men lied. A model can learn sex from video, and use it to infer the truth/deception label, disregarding any other data. We made sure that we do not have artificial bias in the alternative data we use (sex, age, roles). Particulary for sex data, we demontstrate the balance in Suppl. Table 4.

Abouelenien et al. measure effectiveness of physiological, linguistic, and thermal features in deception detection on a laboratory dataset of size 149 and three synthetic topics (mock crime, attitude towards abortion, and best friend)31. We explore other alternative data sources, using field dataset and real topics, and in addition our main goal is a tool to hunt for examiner errors.

Limitations

This is the first detailed disclosure of building and testing a second-opinion tool for classical polygraph. Yet, the subject is immense, and we may have just scratched the surface. To start with, the manual field validation (by double QA-ing candidates for examiner errors) covered only 39 conclusions but, as we explained, even this tiny test required approximately 80 examiner hours (not including a concilium to sort out discrepancies between the two QAs). We hope to grow these experimental statistics after putting the models to live pilot.

Our trainset is contaminated with few and unknown examiner errors and, at least until the field validation, we ran the risk of finding no examiner errors because of the models learning to make all the same mistakes examiners do. Making a gold standard trainset involves double QA-ing hundreds of screenings. While the field validation has proven that our second-opinion tool catches some examiner errors, we still cannot exclude the risk that models are confused by the most common examiner mistakes. The running of our tool in production slowly but surely will grow the golden dataset of screenings QA-ed by three examiners (and conclium in some cases), thus producing a first ever gold standard accessible to academics.

With manual QAs we validated errors where an examiner set NDI erroneously, but we did not investigate erroneous DI labels because we lack DI labels. Less than 7% of the screenings in the archive contain DI labels. Applying our work to detect this second type of errors is a direction for future work.

We decided to not implement examiner textbooks because in the examiner community we hear many discussions on exemptions to almost any textbook rule. To avoid being dragged into these heated, undocumented discussions we decided to use features that do not depend on examiner textbooks or scoring methods. We started with plain and simple raw signal features (min, mean, max on a window). We also started with gradient boosting models. The plan was to up our feature and model game after having the baseline research pipeline built. When we obtained 0.85 + AUC examiner conclusion inference quality, and keeping in mind that we shall avoid a perfect model as explained above, we decided that this is enough for a pilot. We believe that developing sophisticated raw signal features and employing neural networks more suited for time series (such as LSTM) is a good avenue for future work.

We tested several unorthodox data sources for uplift to conclusion prediction models. While there are some preliminary and promising results, most of these are inconclusive and need more data and investigation.

Methods

Ethics information

All methods were performed in accordance with relevant guidelines and regulations. This study neither required nor used any human participants. The study analyzes legacy polygraph screening data that is collected as part of a standard screening process of hiring candidates and employees with critical roles. The hiring candidates and employees sign a written agreement to be screened, including an informed consent for the Bank to store and to utilize the screening data. Internal Security of the Bank anonymized the data before handing it over to the authors of this study.

Dataset description

We possess an archive of 2094 field polygraph screening recordings (PSRs) including Deception Indicated (DI) attributes set by examiners who conducted the screenings. These polygraph screenings (PS) were performed on bank personal with critical roles before hiring, before promotion, or every year, depending on their role. A PS includes a subset of 14 topics, including drug abuse and corruption.

PSRs store physiological signals of the examinee, audio, and questions as strings. Each question data includes three time-stamps relative to each repetition: the start of the question by the polygraph examiner (PE), the end of the question and the moment of the answer. Each question had a type assigned to it (see Suppl. Table 3 for the list of question types). In addition to physiological signals (listed in Suppl. Table 1) the examinee’s sex, age, and job position is recorded.

The screenings were performed on Polyconius polygraph, model 7.

Feature engineering

The basic task was to make inference examiner conclusions (DI or NDI) for a certain topic in the screening.

To build a model we presented data in the following format: each row in the dataset is a record of PS by a certain topic in a particular test for an examinee. Targets are DI or NDI. There may be a bias due to such target-setting, because an examinee may not lie in all tests on a topic during screening.

Physiological signals were extracted from a time window defined by time-stamps of the question’s data. Thus, initially a row is a time series of the physiological signals for a given repetition on given topic.

For every repetition of the relevant and comparison questions we generate the basic statistics: minimum, maximum, mean, amplitude and standard deviation. Further, we used minimum, maximum, mean and standard deviation as aggregate functions at each step. The repetition’s data we grouped by the question. An additional feature is to characterize the difference between the first repetition and the next ones. Similarly, the question’s data we grouped by topic for each test. At the end, each row in the dataset comprises 600 features extracted from PSR for a certain topic in a particular test (Suppl. Fig. 1) for an examinee, with the label (DI/NDI).

Models

We used gradient boosting with a two-level stacking ensemble to avoid the curse of dimensionality. The first-level model trained on 600 physiological features for a topic, inferencing DI/NDI for each test inside a screening. The second-level model aggregated the output of the first-level model for all tests for a topic. At the second-level we have the following features:

  • pred_proba_max—the maximal probability of DI among the tests;

  • pred_proba_mean—the mean probability of DI;

  • pred_proba_min—the minimal probability of DI among the tests;

  • pred_proba_diff—difference between the maximal and mean values.

These probabilities are concatenated with alternative data (biographical data, weather data, geomagnetic storm data). The obtained dataset is fed to the input of the second model, which gives the probability of DI on screening for a topic.

Basic model

This model does not receive information about topics during training and inference (Fig. 3). Information about topic type is saved for further aggregation by screening. For example, the drug abuse (DA) topics of each test are aggregated in the first screening. The second-level model result is an estimation of probability of DI for any given topic of a screening.

Figure 3
figure 3

Universal model scheme. Information that is not received by the model is marked red.

One-topic model

The logic of model construction and feature generation is the same as in the basic model. The difference is that for training, we used features of one topic only. The data is filtered by a single topic before the first-level model is applied. Figure 4 shows how the ensemble is trained on the drug abuse topic, while other topics are filtered out.

Figure 4
figure 4

One-topic model scheme.

Universal model

We decided to use the best sides of the models described above, so we built an ensemble of existing architectures. After a series of experiments, the architecture showed the best result where we use averaging the confidence of the following models (Fig. 5):

  • a basic model built on boosting, using alternative data;

  • a model of a single topic, built on boosting, using alternative data;

  • a basic model built on a random forest.

Figure 5
figure 5

Universal model scheme.

This ensemble was applied to all topics except drug abuse. Aside from traditional advantages of ensembles, here the rationale for using models of different architectures (e.g. gradient boosting and random forest) together is that it hopefully will eliminate some pure model errors and will highlight label (target) errors that constitute examiner errors.

Training

Since we did not have much data (2094 files), we used the stratified group K-Fold Validation to evaluate each of the historical screening. We set the K value equal to 5 and, using the developed a framework of stacking standard models, and received 5 models, trained on 80% of data each.

The custom values that we used as hyperparameters for the standard classifiers of the open source libraries are presented in Table 10.

Table 10 Hyperparameters of standard models used in the framework.

Standard hyperparameters can be found in the documentation of the open source ML frameworks, links to the documentation are in Suppl. Table 7.

Validating

We evaluated the quality of the model using a test set of each validation step described in the Training subsection above. Thus, at this stage, we evaluate success of the model in the classical understanding of machine learning, i.e. as improvements in the main metrics (ROC-AUC, TPR, FPR).

Testing

Our main focus was not to build a high quality ML model for polygram classification, but to use an ML model to detect I-type ML model errors (FP) in polygraph screenings. Since a I-type error from the model’s perspective is equivalent to the II-type error (FN) from an examiner’s point of view, this procedure allowed us to find potential labeling errors in our historical sample, where an examiner did not indicate deception when the deception should have been indicated. After we got desired results at the validation stage, we use an expensive resource, examiners, to re–check polygrams that were likely to contain examiners’ errors, as explained above in the “Results” section, “Validating ML-based second-opinion in the field” section.