Boosting people’s ability to detect microtargeted advertising

Online platforms’ data give advertisers the ability to “microtarget” recipients’ personal vulnerabilities by tailoring different messages for the same thing, such as a product or political candidate. One possible response is to raise awareness for and resilience against such manipulative strategies through psychological inoculation. Two online experiments (total \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N= 828$$\end{document}N=828) demonstrated that a short, simple intervention prompting participants to reflect on an attribute of their own personality—by completing a short personality questionnaire—boosted their ability to accurately identify ads that were targeted at them by up to 26 percentage points. Accuracy increased even without personalized feedback, but merely providing a description of the targeted personality dimension did not improve accuracy. We argue that such a “boosting approach,” which here aims to improve people’s competence to detect manipulative strategies themselves, should be part of a policy mix aiming to increase platforms’ transparency and user autonomy.


.1.1 Personality norms
Questions and distributional information for the raw personality scores were adopted from (2) for extraversion and from (3) for ATI. (2) 1 provide the mean and standard deviation (SD) of the raw scores for each age year between 21 and 60 based on a large Internet study (N = 132, 515, 91% of participants are from the United States and 9% from Canada; no gender-specific norms were available); we were thus able to provide age-matched feedback for extra-/introversion (for participants aged 18-20 years, we used the norms for age 21 years). For ATI, we used the mean and SD of the sample "S5-full" reported in (3) (i.e., no age-or gender-specific norms were available; this sample is a mix of German and US American Mechanical Turk respondents). To achieve consistency across questionnaires, we presented both questionnaires on a 5-point Likert scale. Because the ATI norm study (3) Figure S1 shows the distributions of the extraversion and ATI percentiles calculated based on the respective norms. The results show that our female UK participants are somewhat more introverted than their age-matched US counterparts and slightly more technology affine than the ATI population. For the following three reasons, we argue that this difference in the level of extraversion does not pose a problem to the validity of our results and that using local norms (i.e., participants' empirical position in the distribution of raw mean scores in our study, or, in short, in-sample percentiles) is a worse and not better approach.
First, several studies showed that the US and UK populations are similar in terms of extraversion and therefore using a US sample to calculate a UK respondent's extraversion percentile should, in principle, yield similar extraversion percentiles compared with using a UK sample. From this it then follows that the extraversion percentiles we used to give extraversion feedback (Experiment 1) and evaluate the targeting decisions (Experiments 1 and 2) should roughly align with the respondents placement within their overall population (i.e., all UK residents). Here we highlight three empirical patterns to support this claim: (i) The mean and standard deviation of the distributions of extraversion raw scores of US and UK samples are highly similar (see Table  5 in (5)). (ii) Both our US norm sample (2) and a British household panel study (6) show that extraversion decreases slightly from 18 to 40 years. (iii) In both US and UK samples women reported slightly higher extraversion than men (7). In principle, it would be desirable to use UK-and gender-specific extraversion norms. However, even though there are studies reporting British data (e.g., (5; 6; 7)), we have not yet been able to locate a study (or supplemental material or data) that actually reports mean and standard deviations of the raw scores for the British population, which are necessary to compute global percentiles based on an observed raw score. Percentile based on norm data Empirical cumulative density function (ECDF) Figure S1: Distributions of the extraversion and ATI percentiles in Experiment 1. The top row shows histograms and the bottom row shows empirical cumulative distribution functions (ECDF) of the percentiles for extraversion and ATI, respectively (based on the respective norm data). If participants in Experiment 1 were to completely align with the norm data (2; 3), then the histograms would be uniformly distributed and the ECDFs would lie on the main diagonal. Results show that our female UK participants are somewhat more introverted than their age-matched US counterparts and slightly more technology affine than the ATI population. See text for a discussion of these results. very introverted. Then parts of the more extraverted half of the resulting sample of participants would still be less extraverted than some of the more introverted people in the global population. In other words, even though these participants will have higher in-sample extraversion percentiles compared to their full-population percentiles (e.g., derived based on data, such as reported in (2)), there is no reason to expect them to now respond more to extraverted ads just because they are part of a study that oversampled introverted people. Because our study uses validated stimuli from (1), we aimed to recruit from a population as close as possible to the one microtargeted in that study (i.e., female UK residents ages 18-40; see Methods in the main text for more details). This subpopulation, and the resulting two samples in our study, may not-but also do not need to be-representative of the population of all UK residents in terms of extraversion. As argued above, our goal is assess a participant's level of extraversion relative to the global population and to achieve this, we should use estimates for the mean and standard deviation for that global population.
Third, there are also empirical reasons why using global norms in our study seems preferable to using local norms. Let us assume now that, for whatever reason, in-sample percentiles were actually more representative or relevant for the participants in our two experiments and using those in-sample percentiles would allow us to more truthfully classify participants as extraverted or introverted. If so, then one would expect that re-evaluating participants' targeting decisions according to their in-sample personality category (i.e, calculating their in-sample extraversion percentile, categorizing them as extra-or introverted, and then re-scoring the accuracy of their targeting decisions) should not deteriate participants' apparent detection performance. If anything, one could expect their apparent performance to improve if in-sample extraversion percentiles indeed were more valid (assuming that people, on average, or more likely than not to correctly assess themselves as extra-or introverted.). To empirically assess this conjecture, we conducted a re-analysis of our data. We only included participants from the three control conditions (across both experiments); we did not include participants from the boosting conditions to avoid confounding our analysis with the extraversion feedback (Experiment 1) or exposure to the extraversion definition or questionnaire (Experiment 2). From the total of 422 control participants, 73 (17%) participants previously categorized as introverted would now be re-categorized as extraverted; given that each participants' in-sample percentile is higher than their global percentile (see Figures S1 & S2), none of the participants previously categorized as extraverted changed their assignment. After switching the personality category of a participant, the new accuracy (i.e., proportion correct detection decisions) will, by necessity, be the complement of the original accuracy because now the ground truth of an ad for that participant has switched, while the detection decisions themselves have not (e.g., if accuracy previously was 40%, it will now be 60%; if it previously was 80%, it will now be 20%). Results showed that from the 73 control participants that were re-classified as extraverted, only 6 (8%) improved their accuracy. Thirteen (18%) participants with an original accuracy of 50% stayed at 50%-by necessity because the complement of 50% is again 50%. Markedly, fifty-four (74%) participant had a lower accuracy when using their in-sample extraversion percentile. Across all 73 participants, the median participant's accuracy dropped by 40 percentage points. In sum, this re-analysis suggests that the in-sample extraversion percentiles are not better aligned with participants' personality because control participants performed better if they were scored using their global extraversion percentiles based on (2). This then suggests to us that using the global percentiles, as we did in this study, is the preferable approach.

Statistical analysis
We used a Bayesian mixed-level logistic regression model implemented in the R package brms (8; 9) and its default, vague priors (see code for exact specifications). The preregistered model's syntax is where correct is 1 for correct and 0 for incorrect classification decisions, condition is a deviation-coded factor variable for the boosting vs. control condition, id is a unique identifier for participants, and stimuli is a unique identifier for ads. Note that (1 + condition | stimuli) allows the treatment effect to differ in size by ad. Four Markov chain Monte Carlo (MCMC) chains, each with 8,000 samples, were run; the first 4,000 samples were discarded as warm-up. The MCMC diagnostics indicated good convergence (see section 3.1.2 below).
Posterior distributions were summarized using the median (point estimate) and 95% credible interval (uncertainty interval). Based on the model parameters (see section 3.1.2 below for a summary table), we derived posterior distributions for several key statistics of interest: (a) the probability of a correct detection decision in both conditions, (b) the percentage point difference, and (c) effect sizes between the two conditions. We express effect sizes using the "common language effect size" (CL; 10), which indicates the probability that a randomly selected participant from one condition has a higher value than a randomly selected participant from another condition; a value of 0.5 implies no difference and 1 would imply perfect separation between conditions. CL is well suited to compare conditions in a mixed-level logistic regression model because-unlike the commonly used measures of effect size based on standardized mean differences-CL is invariant to monotonical transformations. That is, its value does not depend on the arbitrary decision about whether to look at the results in log-odds or probability space. We derive the posterior distribution of a CL-comparison based on the model's posterior distributions for the participant-population mean and standard deviation in each condition (setting the item effects to zero, that is, considering the average item).

Experiment 2 1.2.1 Personality norms
Experiment 2 exactly followed Experiment 1 (see section 1.1.1). Figure S2 shows the distributions of the extraversion and ATI percentiles calculated based on the respective norms. The results show that our female UK participants are somewhat more introverted than their agematched US counterparts and similarly technology affine as the ATI population. See section 1.1.1 for an in-depth discussion of why this difference in the level of extraversion does not pose a problem to the validity of our results.

Statistical analysis
The preregistered model's syntax is Percentile based on norm data Empirical cumulative density function (ECDF) Figure S2: Distributions of the extraversion and ATI percentiles in Experiment 2. The top row shows histograms and the bottom row shows empirical cumulative distribution functions (ECDF) of the percentiles for extraversion and ATI, respectively (based on the respective norm data). If participants in Experiment 2 were to completely align with the norm data (2; 3), then the histograms would be uniformly distributed and the ECDFs would lie on the main diagonal. Results show that our female UK participants are somewhat more introverted than their age-matched US counterparts and similarly technology affine as the ATI population. See text for a discussion of these results. Figure produced using R version 4.1.0 (4).
where correct is 1 for correct and 0 for incorrect classification decisions, relevance is a deviation-coded factor variable for the boosting vs. control conditions (i.e., relevant vs. unrelated personality dimension, respectively), questionnaire is a deviation-coded factor variable indicating whether or not participants were administered a questionnaire, id is a unique identifier for participants, and stimuli is a unique identifier for ads. relevance * questionnaire indicates that the model includes the two main effects as well as the interaction relevance : questionnaire.    Figure S5: Personality feedback and description used in the boosting condition in Experiment 1 (i.e., the relevant personality dimension: extraversion). This screenshot is an example for a participant classified as extravert; for participants classified as introverts, the the feedback is reframed in terms of intraversion (i.e., the title reads "You are introverted" and the text below reads "You are more introverted than

Personality feedback screens
). This screenshot is an example for a participant classified as technology affine; for participants classified as not technology affine, the the feedback is reframed in terms of technology aversion (i.e., the title reads "You are technology averse" and the text below reads "You are more averse than [XX] out of 100 people" and "You are less averse than [100 − XX] out of 100 people", where [XX] is the respective percentile).   Figure S9: Comprehension check used in Experiments 1 and 2 prior to starting the detection task. If a participant did not choose the correct answer ("targeted towards my personality type"), the questions was shown again up to two more times, alongside the note "The last answer was not correct, please try again:" (i.e., a total maximum of three attempts). The response options were sorted differently after each incorrect response. Only participants who passed the comprehension check within three attempts were included in the analysis (see 1.1 in the main text and the preregistrations).  Table S1: Stimuli: The 10 ads used in Experiments 1 and 2. The ads in the left column are tailored to extraverts and the ads in the right column to introverts. Images and text were adopted from (1)  Figure S11: Detection performance (in terms of the area under the Receiver Operating Characteristics curve, AUC, based on participants' confidence rating), boosting intervention, and level of extraversion (Experiment 1). Detection accuracy is quantified using the AUC based on participants' confidence rating, using the trapezoid method (i.e., no kernel-or model-based smoothing; 11). In particular, this calculation uses a participant's confidence that the ad is targeted towards them (implied by the participant's binary categorization decision and corresponding rating about how confident the respondent is in the correctness of her decision). An AUC value can be interpreted as the probability that a participant's confidence (in the sense described above) is higher for a randomly selected ad that actually targets this participant compared to a randomly selected ad that does not actually target this participant. A Scatterplot of participants' detection performance (i.e., AUC; y-axis) and their extraversion percentile (from 0 most introverted to 1 most extraverted; x-axis) for boosting vs. control group (color coded). Dots are slightly jittered vertically to avoid overplotting. Curves and confidence bands show robust LOESS curves (locally estimated scatterplot smoothing using re-descending M estimator with Tukey's biweight function) and their 95% confidence band. B Detection performance (i.e., AUC; y-axis) by extraversion quartiles (x-axis) for boosting vs. control group (color coded). Dots show individual participants (jittered horizontally to avoid overplotting). In the boxplots, the box shows the the first, second (median), and third quartiles (the 25th, 50th, and 75th percentiles). The lower and upper whiskers extend from the respective end of the box to the largest value no further than 1.5 × IQR from the box (where IQR is the inter-quartile range, or distance between the first and third quartiles); outliers are not displayed.    Figure S12: Detection performance, boosting intervention, and education (Experiment 1). A Detection accuracy (i.e., proportion correct decisions; y-axis) by education (x-axis) for boosting vs. control group (color coded). The area of the dots and their numbers denote the within-education-and-condition percentage of participants for each of the 11 possible values for a participant's value of proportion of correct decisions (given the 10 ads). n denotes the number of participants for each combination of education level and condition. B Detection performance in terms of AUC (y-axis); see Fig. S13 for more details on AUC. Dots show individual participants (jittered horizontally to avoid overplotting). In the boxplots, the box shows the the first, second (median), and third quartiles (the 25th, 50th, and 75th percentiles). The lower and upper whiskers extend from the respective end of the box to the largest value no further than 1.5 × IQR from the box (where IQR is the inter-quartile range, or distance between the first and third quartiles); outliers are not displayed. Figure produced using R version 4.1.0 (4).

Summary of mixed-level logistic regression model
The text below shows the model summary of the brms Bayesian mixed-level logistic regression model (8; 9) reported for Experiment 1. See section 1.1.2 above for more information on the coding of the variables. Estimate shows the median and l-95% and u-95% show the 95% posterior credibility interval (i.e., the 2.5% and 97.5% percentile, respectively) of the respective marginal posterior distribution. For more details see the R help file ?brms::summary.brmsfit 2 Family: bernoulli Links: mu = logit Formula: dec_correct~1 + condition + (1 | id) + (1 + condition | stimuli) Data: tbl_targeting_1 (Number of observations: 2840) Samples: 4 chains, each with iter = 8000; warmup = 4000; thin = 1; total post-warmup samples = 16000 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).  Figure S14: Detection performance (in terms of the area under the Receiver Operating Characteristics curve, AUC, based on participants' confidence rating), boosting intervention, and level of extraversion (Experiment 2). Detection accuracy is quantified using the AUC based on participants' confidence rating, using the trapezoid method (i.e., no kernel-or model-based smoothing; 11). In particular, this calculation uses a participant's confidence that the ad is targeted towards them (implied by the participant's binary categorization decision and corresponding rating about how confident the respondent is in the correctness of her decision). A Scatterplot of participants' detection performance (i.e., AUC; y-axis) and their extraversion percentile (from 0 most introverted to 1 most extraverted; x-axis) for boosting vs. control group (color coded) and without and with questionnaire (left & right subplot, respectively). B Detection performance (i.e., AUC; y-axis) by extraversion quartiles (x-axis) for boosting vs. control group (color coded) and without and with questionnaire (left & right subplot, respectively). See

Summary of mixed-level logistic regression model
The text below shows the model summary of the brms Bayesian mixed-level logistic regression model (8; 9) reported for Experiment 2. See section 1.1.2 above for more information on the coding of the variables. Estimate shows the median and l-95% and u-95% show the 95% posterior credibility interval (i.e., the 2.5% and 97.5% percentile, respectively) of the respective marginal posterior distribution. For more details see the R help file ?brms::summary.brmsfit 3 Family: bernoulli Links: mu = logit Formula: dec_correct~relevance + questionnaire + (1 | id) + (1 + relevance * questionnaire | stimuli) + relevance:questionnaire Data: tbl_targeting_2 (Number of observations: 5440) Samples: 4 chains, each with iter = 8000; warmup = 4000; thin = 1; total post-warmup samples = 16000 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).