Discriminative machine learning for maximal representative subsampling

Biased population samples pose a prevalent problem in the social sciences. Therefore, we present two novel methods that are based on positive-unlabeled learning to mitigate bias. Both methods leverage auxiliary information from a representative data set and train machine learning classifiers to determine the sample weights. The first method, named maximum representative subsampling (MRS), uses a classifier to iteratively remove instances, by assigning a sample weight of 0, from the biased data set until it aligns with the representative one. The second method is a variant of MRS – Soft-MRS – that iteratively adapts sample weights instead of removing samples completely. To assess the effectiveness of our approach, we induced artificial bias in a public census data set and examined the corrected estimates. We compare the performance of our methods against existing techniques, evaluating the ability of sample weights created with Soft-MRS or MRS to minimize differences and improve downstream classification tasks. Lastly, we demonstrate the applicability of the proposed methods in a real-world study of resilience research, exploring the influence of resilience on voting behavior. Through our work, we address the issue of bias in social science, amongst others, and provide a versatile methodology for bias reduction based on machine learning. Based on our experiments, we recommend to use MRS for downstream classification tasks and Soft-MRS for downstream tasks where the relative bias of the dependent variable is relevant.


A Ablation Study
An ablation study was carried out to better understand the individual components.In it, different components of the method were omitted, and changes were compared.We validated our experiments with the area under the receiver operating characteristic curve (AUROC) and the maximum mean discrepancy (MMD) as metrics.The AUROC ranges from 0 to 1, where values near 0.5 describe a method that uses random guesses, and values closer to 1 indicate an ideal classifier.
To additionally measure the distance between the distributions, we used MMD to verify that the subsample has a lower distance to the representative data set.Since MMD is a distance measurement, its range varies between 0 and ∞, and low values highlight similarities between the two sets.

Random drop
First, we tested MRS to see if it produces better results than the naive baseline without any probability estimation, where the samples are randomly dropped (Figure 1).The algorithm did not have any information of the representative data set and each element had the same probability of removal.
Compared to MRS, the AUROC of the random drop decreases at a much lower rate and reaches the random line in a later iteration (Figure 1a).Differences in MMD values (Figure 1b) further support our hypothesis that MRS decreases the differences in the distributions.The random drop MMD values never decrease and are constantly higher than the corresponding MRS values.That assures that the method actively reduces the differences in the distribution and that the effects are not solely based on the reduced data set size.

Cross-Validation
In the second test, we investigated the influence of cross-validation by comparing MRS with a variant without cross-validation.As shown in Figure 2, the algorithm performs slightly better using crossvalidation.This is not apparent in the AUROC values, as both perform simlar (Figure 2a), but it is more apparent in the comparison of the MMD values, where MRS performed slightly better overall (Figure 2b).We hypothesize that this is due to the relatively small size of GBS, because cross-validation ensures that all instances are chosen equally often and that every instance is guaranteed to be chosen for the training set, enabling more confident estimates.

B Further experiments on US Census Income B.1 Less positive class
In contrast to the aforementioned experiment, the non-representative dataset contains fewer high-income samples.Figure 3 shows similar results to the experiment with an increased fraction of low-income samples.The AUROC decreases until it bounces around the 0.5 line (Figure 3a), the MMD and relative bias decrease until the data set is too small to maintain similarity and increases again (Figures 3b and 3c).

B.2 Representative
To examine how the method behaves on an unbiased dataset, we performed an experiment with two representative data sets (Figure 4).The AUROC in the first iterations was already small enough that MRS would have stopped after removing only few samples (Figures 4a and 4b), indicating that it detects data sets with similar distributions.However, in our experiments, the implementation was changed to ignore the stopping criterion and run continuously to inspect the subsequent behavior.The AUROC (Figure 4a) bounces around 0.49, but does not deviate much.The relative bias changes only slightly at the end (Figure 4c).With few exceptions, the representative ratio is retained until the data set is too small.
Supplementary Figure1: AUROC and MMD of the statistical correction of GBS with the auxiliary information of GESIS.The MRS metrics and a random drop variant are compared.The experiment was repeated ten times.The dark lines represent the means, and the surrounding shades denote the standard deviations.
and MMD of the statistical correction of GBS with the auxiliary information of GESIS.This figure compares the MRS metrics and a variant without cross-validation.The experiment was repeated ten times.The dark lines represent the means, and the surrounding shades denote the standard deviations.
Relative bias for income.Supplementary Figure3: AUROC, ROC curve, and MMD of the statistical correction of the US Census Income subset containing less positive samples (high income) with the auxiliary information of the representative subset.The vertical lines indicate the iteration in which the remaining samples were declared to be maximal representative subsamples, and the top numbers represent the remaining samples.