Statistical modeling, estimation, and remediation of sample index hopping in multiplexed droplet-based single-cell RNA-seq data

Sample index hopping can substantially confound the analysis of multiplexed sequencing data due to the resulting erroneous assignment of some, or even all, of the sequencing reads generated by a cDNA fragment in a given sample to other samples. In those target samples, the data cross-contamination artifact takes the form of “phantom molecules”, molecules that exist only in the data by virtue of read misassignment. The presence of phantom molecules in droplet-based single-cell RNA-seq data should be a cause of great concern since they can introduce both phantom cells and artifactual differentially-expressed genes in downstream analyses. More importantly, even when the index hopping rate is very small, the fraction of phantom molecules in the entire dataset can be high due to the distributional properties of sequencing reads across samples. To our knowledge, current computational methods are unable to accurately estimate the underlying rate of index hopping nor adequately correct for the resultant misassignment in droplet-based single cell RNA-seq data. Here, we introduce a probabilistic model that formalizes the phenomenon of index hopping and allows the accurate estimation of its rate. Application of the proposed model to several multiplexed datasets suggests that the sample index hopping probability for a given read ranges between 0.003 to 0.009, arguable low numbers, even though, counter-intuitively, they can give rise to a large fraction of phantom molecules - as high as 85% - in any given sample. We also present a model-based approach for inferring the true sample of origin of the reads that are affected by index hopping, thus allowing the purging of the majority of phantom molecules in the data. Using empirical and simulated data, we show that we can reassign reads to their true sample of origin and remove predicted phantom molecules through a principled probabilistic procedure that optimally minimizes the false positive rate. Thus, even though sample index hopping often substantially compromises single-cell RNA-seq data, it is possible to accurately quantify, detect, and reassign the affected reads and remove the phantom molecules generated by index hopping. Code and reproducible analysis notebooks are available at https://github.com/csglab/phantom_purge.


Précis
Due to the increasing capacity of modern sequencing platforms, sample multiplexing, the pooling of barcoded DNA from multiple samples in the same lane of a high-throughput sequencer, is rapidly becoming the default option in single-cell RNA-seq (scRNA-seq) experiments. However, as several studies have recently shown (Sinha et al., 2017;Costello et al., 2018), multiplexing leads to incorrect sample assignment of a significant fraction of demultiplexed sequencing reads. Out of several mechanisms that can introduce sample index missassignment (MacConaill et al., 2018), the presence of free-floating indexing primers that attach to the pooled cDNA fragments just before the exclusion amplification step in patterned sequencing flowcells has been shown to be the main culprit (Illumina, 2017). This phenomenon is known as sample index hopping and results in a data cross-contamination artifact that takes the form of phantom molecules, molecules that exist only in the data by virtue of read misassignment ( Figure 1a). The presence of phantom molecules in droplet-based scRNA-seq data should be a cause of great concern since they can introduce both phantom cells and artificial differentially expressed genes in downstream analyses. Importantly, it is conceivable that even when the index-hopping rate is very small, the fraction of phantom molecules can still be high due to the distributional properties of sequencing reads across samples (Figure 1b).
Despite recent attempts to computationally estimate the rate of sample index hopping in plate-based scRNA-seq data (Larsson et al., 2018;Griffiths et al., 2018), no statistical model of index hopping for droplet-based scRNA-seq data has yet been proposed. Consequently, current computational methods can neither accurately estimate the underlying rate of index hopping nor adequately remove the resulting phantom molecules in droplet-based scRNA-seq data. This has been a challenging problem since droplet-based libraries are tagged with a single sample index rather than a unique combinatorial pair of sample indices such as those used in plate-based approaches. As a solution to this problem, we here propose a statistical framework that provides (i) a generative probabilistic model that formalizes in a mathematically rigorous manner the phenomenon of index hopping, (ii) a statistical approach for inferring the sample index hopping rate (SIHR) in droplet-based scRNA-seq data at the level of individual reads, (iii) a non-heuristic, model-based approach for inferring the true sample of origin of hopped sequencing reads, and (iv) a data decontamination procedure for purging phantom molecules that optimally minimizes the false positive rate of molecule re-assignments.
The generative probabilistic model we propose starts with the observation that each cDNA fragment, in addition to its sample barcode index, has a cell barcode and a unique molecular identifier (UMI), and maps to a specific gene. As has been suggested previously (Griffiths et al., 2018), we make the assumption that any particular cell-UMI-gene combination (hereafter referred to as CUG) is so unlikely that it cannot arise independently in any two different samples (Section 1). Accordingly, each CUG would represent one unique molecule and all sequencing reads with the same combination would correspond to PCR amplification  Observed hopping from S2 to S1 Observed hopping from S1 to S2 Observed overall SIHR products of that original molecule. A second assumption we make is that the probability of index hopping is the same for all reads, regardless of the source or target sample of the read (Section 2). We were able to validate both assumptions ( Figure 1e and Section 4) by analyzing data from an experiment in which two 10X Genomics scRNA-seq sample libraries were sequenced in two conditions. In the first condition, the samples were multiplexed on the same lane. In the second condition, two sample libraries were sequenced on two separate lanes of HiSeq 4000 (this non-multiplexed condition provides a ground truth for the true sample of origin of each CUG). The validation dataset has been deposited on the open-access data repository Zenodo (https://doi.org/10.5281/zenodo.3267922) Building on these two assumptions, we derived a mixture-of-multinomials model whose likelihood is governed by a single "index hopping" parameter that determines the observed distribution of read counts across samples at each PCR amplification level (Methods Section 3.1.2). We used this model to further derive a closed-form expression of the probability distribution of "chimeric" observations, namely CUGs whose corresponding reads are assigned to multiple samples ( Figure 1c). We were then able to estimate the index hopping rate by a 3,0,0 2,1,0 1,2,0 0,3,0 2,0,1 1,1,1 0,2,1 1,0,2 0,1,2 0,0,3 π 3 =0.5 π 3 =0.2 π 3 =0.3  Figure 2: (a) The simplex containing all 10 possible read count outcomes at PCR amplification level (r = 3) and for the case of 3 samples. The pie charts depict the posterior probabilities of the true sample of origin for each outcome. π 3 represents the proportion of molecules with r = 3 that originate from each sample in this toy example. (b) Effect of purging and discarding on the number of false positive and false negative counts. Points closer to the origin are more optimal. FP: false positive; FN: false negative. (c) The molecular library complexity profile for eight samples in a previously published HiSeq 4000 dataset (Bach et al., 2017). (d) The proportion of real and phantom molecules in each of the eight samples, the fraction of molecules and mapped reads in the entire dataset that belong to each sample, the read-to-molecule ratio (RMR) per sample, and the corresponding FDR statistics (i.e. the within-sample proportion of molecules that we miss-classify as real after purging). (e) The effect of purging on gene expression profiling in the validation dataset with known ground truth. Each dot represents one gene expression measurement in one cell. Dots are colored based on cell-sample assignment in the ground truth, with red and blue representing cell barcodes that are found only in Sample 1 or Sample 2, respectively. Green represents cell barcodes that are found in both samples (barcode collision). Note that non-zero Sample 1 UMI counts for blue dots and non-zero Sample 2 UMI counts for red dots represent phantom molecules. (f ) Same as panel (e), but with the counts aggregated per cell (each dots represents one cell). Blue dots with high counts in Sample 1 and red dots with high counts in Sample 2 represent phantom cells. Green dots represent potential transcriptome mixing.
fitting the resulting generalized linear model to the empirically observed distribution of chimeric CUGs across PCR amplification levels. We observed close agreement between the observed and fitted non-chimeric CUGs across multiple datasets, including two eight-sample HiSeq datasets from mouse epithelial cells (Bach et al., 2017;Griffiths et al., 2018) and two 16-sample NovaSeq 6000 datasets from Tabula Muris (Consortium, 2018) (Figure 1d and Figure 8). We also observed excellent agreement between our SIHR estimate and the empirical estimate for the dataset with known ground truth (Figure 1e and Figure 6). Overall, we found that SIHR ranges between 0.3% and 0.9% in the datasets we analyzed (Table 5).
Furthermore, the proposed framework also allows the calculation of the posterior distribution of the true sample of origin for each CUG, given its read counts across samples, the SIHR, and the molecular proportions complexity profile of the samples (Figure 1a). It then assigns each CUG to the most likely sample of origin, and further decontaminates the data by removing assignments that have low posterior probability. We have devised an approach for estimating the number of false positives (FP) and false negatives (FN) for distinguishing true molecules from phantom molecules after decontamination at different posterior probability cutoffs, enabling the selection of a cutoff based on a user-specified marginal trade-off parameter that represents the number of real molecules one is willing to discard in order to correctly purge one phantom molecule (see Methods Section 3.4 and Tables 6). These model-based FP and FN estimates are in excellent agreement with empirical estimates based on ground truth ( Figure 2b). We observed that we can achieve a sensitivity of 0.999 (down from 1 in the non-purged data) and specificity of 0.979 (up from 0 in the non-purged data) in distinguishing true molecules from phantom molecules across all samples. Furthermore, our model-based purging of phantom molecules outperforms, both on validated and simulated data, a previous heuristic approach (Griffiths et al., 2018) that is based on retaining CUGs with >80% (default setting) of reads assigned to one sample ( Figure 2b, Figure 4.2.3, Figure 12, Table 11, Table 4.3).
Surprisingly, we found that the proportion of phantom molecules (PPM) varies widely depending on the molecular proportions complexity profile (Figure 2c and Figure 5) of the samples, and can reach as high as 86% in low-complexity samples (i.e. samples in which a relatively small number of unique molecules contribute to the majority of sequencing reads, Figure 2d). We believe that high PPM values might in fact be common in multiplexed datasets that contain both low-and high-complexity samples, since hopping of even a small fraction of reads from high-to low-complexity samples can create more phantom molecules than the unique real molecules native to the low-complexity sample. Overall, these results indicate that even a small sample index hopping rate can be a substantial confounding factor in multiplexed scRNA-seq experiments by overwhelming the data with phantom molecules. However, the effects of index hopping on the integrity of the data (Figure 11 and Tables 7, 8 , 9, 10), whether it leads to mixing of transcriptomes, phantom cells, or the miss-classification of RNA-containing cells, can be almost completely remedied by model-based purging of phantom molecules (Figure 2e,f).

Background
This section describes the phenomenon of sample index hopping and provides a brief overview of existing published research that aims to quantify its rate and to correct for its data corrupting effects. The section also highlight the limitations of existing approaches and argues that the issue of index hopping is a significant problem in the field, one that no satisfactory solution for yet exists.

Overview of Sample Index Hopping on Illumina's Sequencers
Patterned flowcells. Ever since the emergence of next-generation sequencing technologies in 2006, the speed and throughput at which we can perform whole-genome DNA and RNA sequencing have been steadily increasing. For example, in recent years Illumina has introduced a family of sequencing platforms (i.e. the HiSeq 4000 and NovaSeq 6000) characterized by substantial increased capacity, faster run times, and lower sequencing cost. The significant boost in sequencing throughput 1 first came about in 2015 when Illumina introduced the patterned flow cell technology on the HiSeq 4000 sequencers. In particular, there were two key innovations that led to the significantly improved performance. (1) A flow cell surface design that optimizes the prearranged spacing of nanowells, thus allowing the accurate imaging of billions of clusters of amplified cDNA fragments. (2) The Exclusion Amplification (ExAmp) chemistry that instantaneously amplifies a single cDNA fragment, effectively excluding and preventing other cDNA fragments from forming a cluster within the same nanowell. In 2017, Illumina launched the NovaSeq 6000 instrument with an updated flow cell design that further decreased the spacing between the nanowells, thus significantly increasing the density of generated clusters and, consequently, the amount of data that can be generated. Currently, a total of 2.5B single reads can be generated on each of the four lanes that comprise a single NovaSeq S4 flow cell, compared to 350M reads on each of eights lanes of HiSeq 4000.
Sample multiplex sequencing. However, given that for most sequencing studies, the number of reads required to achieve an adequate transcriptome coverage tends to be in the tens of millions, it then becomes necessary to resort to sample multiplex sequencing strategies to fully utilize the massive capacity and cost efficiency that the patterned flow cell with ExAmp chemistry technology can bring about. The process of sample multiplexing can be briefly summarized with the following few steps. First, a sample barcode index is added to each cDNA fragment during library preparation (to one end in single-indexing or to both ends in dual-indexing). Then, cDNA fragments from multiple samples are subsequently pooled together in the same lane to be cluster-amplified and sequenced. Finally, the generated sequenced reads are demultiplexed into their respective source samples using the sample barcode indices. Note that with current technology, in a single run of droplet-based RNA-seq 1 As measured by the total number of nucleotide sequence reads generated in a single run of an experiment experiment, a maximum of a 384 samples can be multiplexed (Chromium i7 Multiplex Kit) using a 96-plex on each of the four lanes of a NovaSeq S4 flow cell.
Sample read misassignment. Unfortunately, sample multiplexing can cause a significant percentage of the demultiplexed sequenced reads to be misassigned to an incorrect sample barcode. Although sample read misassignments can arise due to several factors (MacConaill et al., 2018), one specific mechanism termed sample index hopping is the primary cause of read misassignments in patterned flow cells. Index hopping is believed to result from the presence of free-floating indexing primers that attach to the pooled cDNA fragments just before the exclusion amplification step that generates clusters on the flow cell. In addition, if during library preparation, free adapters or primers are not properly removed, the resulting purified library would show higher levels of index hopping that increases linearly with the molar concentration of free adaptors relative to DNA input (Illumina, 2017).

Discovery and Quantification of Index Hopping
The problem of index hopping was first identified as early as December 2016 in a blog post reporting read sample misassignment on HiSeq 4000 and HiSeqX platforms (Hadfield, 2016). Soon afterwards, Illumina (2017) released a white paper acknowledging that index switching does indeed occur and tends to be higher in machines that use a patterned flow cell, but maintaining that the phenomenon affects only <2 % of reads. However, around the same time, Sinha et al. (2017) reported that 5-10% of sequencing reads were incorrectly assigned a sample index in a multiplexed pool of plate-based scRNA-seq samples. More recently, using two plate-based scRNA-seq datasets, Griffiths et al. (2018) provided a lower estimate of the index swapping on the HiSeq 4000 at approximately 2.5%. Another study (Vodák et al., 2018) conducted in the context of exome sequencing also reported a contamination (index hopping rate) ranging from 2% to 8%. Using unique antigen receptor expression, (Yao et al., 2018) estimated the index hopping rate in plate-based single cell RNA-seq data (i.e. spread-of-signal across wells) to be approximately 3.9%. They also failed to detect any evidence that sample barcode indices vary in their proneness to undergo index swapping than others. With the aim of reconciling the conflicting hopping rate estimates that had been reported, Costello et al. (2018) used a non-redundant dual indexing adapters developed in-house and performed an exhaustive study across multiple libraries (whole genomes, exome, and stranded RNA) and sequencer models (HiSeq 4000, NovaSeq 6000 and HiSeqX) to determine the rate of index hopping. They observed a rate of 0.2% to 6% in all sequencing runs. More importantly, they showed that even in bulk RNA-seq libraries where the indexing hopping rate was as low as 0.32%, spurious results in downstream analysis can be generated. This finding mirrors the conclusions reached by the other five papers referenced in this paragraph including Illumina's white paper.

Impact and Significance of Index Hopping.
Given the large amount of data that are being increasingly generated using multiplexed sequencing, index hopping should be a great cause of concern due to potential signal artifacts and spurious results that it can generate. Here we list the three main unwanted consequences in the context of scRNA-seq sequencing experiments that sample index hopping can bring about through the introduction of phantom molecules.
1. Mixing of transcriptomes: when a given cell-barcode is observed in both the donor and target samples, index hopping would result in phantom molecules being introduced in the cell with the corresponding cell-barcode in the target sample. When this happens, the rates of false positives and even false negatives can increase in downstream statistical analyses such as differential gene expression.

2.
Phantom cells: when a given cell-barcode is observed only in the donor sample, but not the target, phantom molecules would also lead to phantom cells emerging in the target sample.
3. Cell-barcode miss-classification: an abundance of phantom molecules associated with a given cell-barcode would lead cell-barcoding algorithms (used to computationally determine which cell-barcodes originate from proper cells) to incorrectly classify an empty droplet as a cell and conversely to classify a proper cell as an empty droplet.
2.4 Mitigating the Effects of Sample Index Hopping.

Experimental Strategies.
Given that index hopping would occur in all sequencing platforms that uses patterned flow cells, one can resort to any of the following three experimental strategies to mitigate sample index hopping from affecting the integrity of experimental data.
1. One sample per lane. By avoiding multiplexing altogether and running one sample per lane, one can confine the sample indices in a given lane to be for one given sample only, thus minimizing the risk of index hopping from occurring in the first place. However, sample read misassignments can still occur due to other causes.
2. Unique-at-both-ends dual indexing library. By utilizing two unique barcodes for each sample, sample misassignment would occur only when both sample indices hop. Given the low probability that such an event would occur, index hopping can be reduced and computationally mitigated by discarding reads with unexpected combinations post-hoc.
3. Post-library prep treatment. By using Illumina's Free Adapter Blocking Reagent, the 3 ends of the free adapters become blocked preventing their extension and thus reducing the rate of index hopping Illumina (2017).
Limitations of Experimental Strategies. Although running one sample per lane can be feasible on HiSeq platforms for some single cell RNA-seq libraries, it would be financially prohibitive on newer higher capacity NovaSeq platforms. As for unique-at-both-ends indexing, the strategy is currently incompatible with single index droplet-based protocols such as the widely used 10x Genomics single cell protocol. Lastly, the blocking reagent Illumina provides is only compatible with a few bulk DNA and RNA protocols, but not for single cell protocols. Whether it can be incorporated in single cell droplet-based protocol is yet to be determined.

Computational Strategies.
Given the limitations of existing experimental strategies to eliminate index hopping, the urgency to develop computational strategies instead has only intensified since the phenomenon was first discovered. Indeed, the first attempt to computationally correct for index hopping in multiplexed sequencing libraries was only recently published (Larsson et al., 2018). However, the linear regression method the authors propose is applicable only to plate-based scRNA-seq samples, where a unique combination of a pair of sample barcodes in each well determines the identity of cells and can give rise to "crosshair" pattern in the data when index hopping occurs. It is this pattern that allows the estimation of fraction of hopped reads and upon which their proposed approach relies. Another limitation of their approach is that it uses only a subset of the data corresponding to genes whose specific cell expression is above a given threshold. However, Griffiths et al. (2018) developed a more accurate method for plate-based scRNA-seq that makes fewer assumptions and that uses data from all the genes. In the same paper, the authors also proposed a heuristic computational strategy for droplet-based scRNA-seq 10x Genomics experiments that excludes hopped reads without removing the corresponding cell libraries, but did not attempt to propose a modeling framework for estimating sample index hopping in droplet-based data since unlike plate-based assays, droplet-based assays do not use a unique pair of sample barcodes but rather use one mate of the paired read for quantification and a second mate to carry the UMI and cell barcode tags, thus rendering the "crosshair" pattern approach inapplicable.
Limitations of existing computational strategies. The computational strategy proposed in Griffiths et al. (2018) is the only attempt we are aware of that is aimed at estimating the rate and mitigating the effects of index hopping in droplet-based scRNAseq data. Nonetheless, the strategy the authors propose, similar to other computational approaches we mentioned in the previous paragraph, does not in fact estimate the index hopping rate of individual reads, but rather computes a proxy measure, the swapped fraction in the case of Griffiths et al. (2018), which they define as the fraction of molecules (i.e. those with an identical UMI, cell barcode, gene label) that are observed in more than one sample. As we show in this paper, the probability at which an individual sequencing read swaps the sample index is not the same as the probability that one out of all the PCR-duplicated reads from the same molecule swaps the sample index. As we show, even when the sample index hopping probability at the level of individual read is very low, the fraction of phantom molecules, molecules that we observe to have swapped sample indices, can vary greatly, even as far as comprising the vast majority of molecules in a given sample. As for correcting for the effects of index hopping, the proposed molecular exclusion heuristic the authors propose suffers from high positive and negative rates, resulting in unnecessarily discarding molecules that otherwise should have been retained and conversely, in retaining molecules that should have been discarded. In particular, by overlooking the possibility that a given fraction of observations consists entirely of phantom molecules, the molecule exclusion strategy they propose can potentially retain a high proportion of false positives molecules. Furthermore, by assigning the true sample of origin as the sample with the largest read fraction while discarding all molecules that have a read fraction below a given threshold (default being ≤ 0.8), their strategy would further result in high false positive since neither the information implicit in the absolute read count distribution nor the extremely variable distribution of molecular proportions (i.e. library complexity) across samples and PCR duplication levels are considered. For example, even if an observation such as y = (1, 4, 0, 0) has the highest read fraction at s = 2, it would still be an unlikely event if for instance we have 70 times more molecules in Sample 1 than in Sample 2, that is, π 5 = (0.7, 0.01, 0.19, 0.1).
Contrasts with Proposed Approach. The probabilistic model we propose in this paper aims at capturing the distributional patterns of sequencing read counts in order to both estimate the fundamental quantity of interest the sample barcode index hopping rate at the level of individual reads and quantify its manifestation as a data contamination artifact as measured by the fraction of phantom molecules. As we show (see Equation 3), the fraction of phantom molecules is determined not only by the sample index hopping rate, but also by the number of multiplexed samples, as well as by their library size, coverage, and molecular complexity profiles (i.e. the distribution of molecular proportions conditional on the PCR duplication levels). As such, this complex relationship, which we have attempted to capture in our proposed model, can potentially provide an explanation for the large variance in the index hopping rate estimates (0.2% -8%) that have been reported in the literature (Larsson et al., 2018;Griffiths et al., 2018;Costello et al., 2018;Vodák et al., 2018). Furthermore, we use the probabilistic modeling approach to discard molecules not based on their observed read fraction, but rather on a corresponding posterior probability that is a function of not only the distribution of the read counts, but also the model's estimated sample index hopping rate, the library sizes and the expression profiles of the multiplexed samples, as the following paragraph illustrates.
Illustrative Examples from Empirical Data Consider the following scenarios encountered in the data. For example, in the Hiseq 4000 dataset, q (the maximum posterior probability of the true sample of origin) is higher for outcome y (opt) = (0, 0, 0, 1, 1, 0, 0, 1) than it is for y (below) = (0, 1, 2, 0, 0, 0, 1, 0) even though in the latter, the inferred true sample of origin s = 3 (Sample B1) has two reads whereas the three molecules in the former has one read each, including the inferred true sample of origin s = 5. This might seem counter-intuitive at first sight, but once we consider the proportion of molecules, it becomes apparent why it makes sense to retain the molecule corresponding to s = 5 but discard the one corresponding to s = 3. For this observation, we haveπ r = (0.075, 0.136, 0.004, 0.005, 0.169, 0.517, 0.082, 0.011) for r = 3 for r = 4. Given that the lowest proportions are for s = 3, 4, 8 in that order, and highest for s = 5, it does make sense indeed to expect that the two reads y (opt) originated from s = 5. In contrast, for y (below) , it is not really clear that the hopped reads originated from s = 3 given the sample has the lowest proportion of molecules. For the Hiseq 2500 data, we haveπ r = (0.096, 0.173, 0.001, 0.001, 0.216, 0.409, 0.093, 0.011) for r = 3, which explains why we end up discarding the molecule s = 8 in y (below) but not in y (opt) since there are half the number of molecules in s = 5 than in s = 6, making it more likely that the read indeed hopped from s = 8. We have a similar situation for the Novaseq 6000 datasets where the sample with the lowest proportion of molecules (i.e. 0.005) corresponds to s = 15 (i.e. Sample P 7_8 ), making it more likely that the two hopped reads in y (opt) of L1, the one hopped read in in y (below) of L1, and the one hopped read in y (opt) of L2 actually originated from s = 9. This also makes y (below) of L2 more likely to be a fugue non-chimeric observation (see definitions, Section 3.1.3), such that the one read in s = 15 would in reality have had hopped from a sample with a larger number of molecules.

Methods
Although much of what we propose in this section can be ported, modified, and applied to other protocols, we limit the scope of the proposed approach to droplet-based scRNA-seq data, with libraries prepared using the 10x Genomics Single Cell Protocol and that are subsequently multiplexed for sequencing on Illumina machines.

Generative Model Specification
Consider a droplet-based single-cell sequencing experiment where a total of S (ranging from 2 to 96) libraries are pooled together and multiplexed on the same lane of a patterned flow cell. Here we simply define a sequencing library as a collection of index-barcoded and PCR-amplified cDNA fragments which are purified from the mRNA of a particular sample tissue. In a single sequencing run, millions of short sequencing reads are generated (from 350M reads on a single lane of HiSeq 4000 to 2.5B reads on a single lane of a Novaseq 6000 S4 flowcell), each of which is annotated with barcodes representing the sample, cell, and molecule from which the read originated. If the reads are aligned to the genome, a read would also be annotated with the genomic location where it mapped to. For example, each mapped read in a 10x Genomics Single Cell 3' v2 Gene Expression Library is thus annotated by four labels: 1. A sample barcode. In the case of a 10x Genomics single-indexed library, there are four indices for each sample. In what follows, we collapse all four indices into a single sample index for computational, mathematical, and practical considerations. However, as we show in Section 3.2.2, the model we propose allows us to recover the sample index hopping rate at the level of the barcodes. That is, after sample demultiplexing and transcriptome alignment, each read gets associated with a cell-umi-gene-sample label. We note here that by discarding unmapped reads and reads that mapped to genomic regions other than exon bodies, we are further making the implicit assumptions that the large portion of data that we end up retaining contains enough information to determine the index hopping rate and that the data we discard is not characterized by a different mechanism underlying the sample barcode index hopping phenomenon.
In scRNA-seq data, a cell is identified by a unique cell barcode. However, only a fraction (1k-10k) out of the +100K observed cell barcodes correspond to actual cells. In each of these cells, we typically observe anywhere from a minimum of a 1000 (a threshold usually specified manually) up to approximately 100,000 unique molecules, or more depending on the average read coverage per cell. Furthermore, given that the UMI index is 10bp long, the number of possible UMIs is one order of magnitude larger than the number of UMIs typically observed in any one cell. Accordingly, a UMI collision in a single cell, when it does occur, would result in two reads from different molecules to be considered as originating from the same cDNA fragment (i.e. PCR duplicates), thus reducing the number of real molecules that can be observed in any one cell. To further reduce the chance of collisions, gene ID labels can be considered in order to limit the space of collisions to those UMIs that are mapped to the same region of the genome. Although these genomic regions, each consisting of the union of exons belonging to a given gene, vary in length, the localization of possible collisions to a small region of the genome greatly reduces the probability of UMI collisions.

Modeling Sequencing Reads with a Mixture of Categorical Distributions
We start by observing that after cDNA molecules of a given transcript are amplified, they are subsequently fragmented into 300-400 bp fragments, each containing both a cell barcode and a UMI. During the sample index PCR amplification step, an Illumina Read 2 adapter is ligated to the fragments before they get amplified, resulting in a given number of sequencing reads n l ∈ N + to have the same cell-umi-gene-sample label l = 1 . . . L where L ∈ N + denotes the total number of unique combinations of the labels. If none of the sequencing reads n = 1 . . . n l belonging to label l are misassigned, then each read's observed sample index would correspond to the true sample index. For the case when index hopping occurs, we can model the read misassignment process as a mixture of S categorical distributions where each observed read sample index belongs to one sample. See Table 1 for an example toy dataset in which each row corresponds to a single sequencing read with label l.
In particular, for the n'th sequencing read with label l, let s ln ∈ {1, · · · , S} denote a molecule's true sample index and d ln ∈ {1, · · · , S} a molecule's observed sample index. We can formulate the mixture model formally as a two-stage hierarchical sampling process, where we first draw a molecule with an unobserved true sample index s ln that gets assigned to read n with label l from a categorical distribution governed by a label-specific probability parameter vector ϕ l . Next we draw the observed sample index from another categorical distribution with a probability parameter vector p s ln whose value is conditional on the value of s ln . In what follows, s ln denotes the index of the element in the vector s ln for which we observe a one (i.e [s ln = 1]).
The probability parameter vectors of all the S distributions can be stacked together in P , the sample hopping probability matrix.
When P = I, then the probability that a read keeps its sample index of origin is one, and consequently, reads do not hop over to other samples. That is, d ln = s ln for all l and n.
Here p ij denotes the probability that a read from sample i hops to sample j. The number of parameters in P equals S × (S − 1).

Modeling Read Counts with a Mixture of Multinomial Distributions
In Model 1 there are a total of L parameter vectors ϕ l each taking values on the probability simplex. Note that the n l reads associated with label l can come from different source samples. This could happen for example when reads from two samples are assigned the same label simply by chance. However, such an outcome is extremely unlikely and by ruling it out we can greatly simplify the modeling framework if we constrain each of all the L parameter vectors ϕ l to be the one vector (i.e. 1). That is, we assume that all the reads with a given cell-umi-gene (CUG) label (i.e. PCR read duplicates of a unique cDNA molecule) can be indexed only with the same sample barcode. Accordingly, each CUG would represent one unique molecule and all sequencing reads with the same combination would correspond to PCR amplification products of that original molecule. More formally, we formulate the assumption as follows.
That is, we restrict ϕ l to belong to the discrete S-simplex, a set with cardinality S whose elements are vectors each consisting of a single 1 and zeros elsewhere. Table 1: A toy dataset for three multiplexed samples (i.e. S = 3) in which each observation is a sequencing read with label l (i.e. a unique combination of c, u, and g, which denote the cell-barcode, UMI, and gene, respectively), a true source sample s, and an observed sample d i , i = 1, 2, 3. In a typical experiment, a single lane can generate anywhere from 350 million to 2.5 billion sequencing reads.
Distribution of sample read counts. As a result of Assumption 1, the sample index s ln becomes a constant random variable drawn from one of S degenerate categorical distributions such that all the n l reads with label l are assigned the same sample index. That is, for each ϕ l , all the p s ln vectors become the same for all the n l reads (i.e. p s l ). Consequently, we can sum over the n l categorical random variables to obtain a mixture of multinomials model for data {y l }, where y l = (y l1 , · · · , y ls , · · · , y lS ) denotes the vector of read counts across S samples corresponding to an observed label l and an unobserved source (i.e. donor) sample s l ∈ S := {1, · · · , S}. Note here that the total number of mapped reads across all samples is given by N = n = (n 1 , · · · , n L ) 1 where n is the L-dimensional vector of read count sums. In what follows, we partition n by the unique values (i.e. read count sums) that n l can take, namely, r ∈ R := {1, . . . , R | r ∈ N + }, where R = I and R is the maximum value in R. For each r ∈ R let m r ∈ M := {m 1 , . . . , m R | m r ∈ N + } denote the corresponding number of times r is observed in n such that R r=1 m r = L. We denote the empirical distribution of r also by M(.). Since the variable r counts the number of PCR duplicated reads -whether hopped or not -that a given unique molecule generates, we will refer to r as the PCR duplication level. Accordingly, we can write the mixture of multinomials model conditional on the PCR duplication level r as follows.
( 2) where l now indexes the m r observations with PCR duplication level r; the vector p s l denotes the s l row of the S × S matrix of the sample read index-hopping probabilities; and the probability vector π r represents the proportion of molecules across the S samples at PCR duplication level r.
The mixture model can be viewed more intuitively as a generative process in which we first sample a molecule s l from a library sample s l according to the categorical model then we amplify the molecule by generating r PCR read duplicates according to the multinomial model. The number of molecules originating from a given sample with a given PCR read duplicates r is determined by the parameter vector π r whereas the number of PCR duplicated reads that end up hopping to other samples is determined by the parameter vector p s l ).
The molecular proportions complexity profile. Let v r = mr l s l denote the molecule count across the S samples at level r and v = R r mr l v r = R r mr l s l the molecule count across the S samples for the entire data. Since E(v r ) = m r × π r , the expected total molecule count is therefore E(v) = R r m r × π r . If π r = π for all r = 1, . . . , R, then E(v) = L × π would correspond to the library complexities of the S samples, where Library complexity is defined as the expected number of unique molecules sampled with a finite number of sequencing reads generated in a given high-throughput sequencing run. It is important to note here that in a multiplexed experiment, several samples that usually vary in their molecular complexity are sequenced together. These samples would differ in the total number of unique transcripts each one has due to a host of factors, ranging from the presence of varying amounts of RNA that characterize different cell types (e.g. neuronal cells have low RNA content) to accidental errors in library preparation that could cause many cells to break up and lose their endogenous mRNA. That is, even if total number of available sequencing reads were budgeted evenly over the multiplexed samples, the number of unique molecules detected in the sequencing run could vary widely across the samples. We propose here that a more informative picture could be gained into the root cause of a sample's low library complexity, if we consider the molecule counts conditional on the PCR duplication level r or rather the set of molecular proportions Π := {π r } R r=1 , which we term the molecular proportions complexity profile, in order to assess and identify potential problems across all the samples in a sequencing run. In Section 3.3.1 we derive Equation 6 for estimating Π from the observed proportions of total reads and in Section 4, we use molecular proportions complexity profile plots to identify problematic samples in empirical datasets.
The cardinality of the joint sample space. The conditional model's sample space is a joint space constructed by the Cartesian product of the sample space of the read count variable y and the sample space of the sample-of-origin latent variable s.
The total number of possible read count outcomes conditional on PCR duplication level r is given by the cardinality of sample space of Y r . Since these outcomes correspond to the coefficients of r'th component of the Pascal's S-Simplex ∧ S r , the number of the outcomes (i.e. elementary events) are therefore given by the (r + 1)'th triangular number t (r+1) . Now, since the cardinality of S is determined by the number of samples S, the cardinality of (S × Y r ), which we denote by e r , can be obtained as such.
The cardinality e r increases very rapidly as S or r increases. Note that the number of possible outcomes across all possible values of r (i.e. the cardinality of the joint sample space of (S × Y r × R) and which we denote by e) is much larger since its cardinality is the sum of all outcomes across a range of r values (i.e. e = r e r ). For example, for S = 8 and r = 1, . . . , 100, e is approximately 3 trillion! When we consider the multinomial mixture model across a range of r values, it can be helpful to view the model in terms of how it divides and allocates the unit probability measure to a set of e outcomes, as determined by the parameter values that govern the distribution. To illustrate, we list in Table 2 the e 1 = 9 possible outcomes when we have S = 3 samples for data confined to PCR duplication level r = 1. In the table, we also list the probabilities assigned to the possible outcomes under two possible choices of model parameters. Notice that since s is latent, we only see collapsed marginal outcomes in the data we observe. That is, we see 3 outcomes rather than 9. We would like to highlight here an important point that is very relevant to subsequent sections; namely, if we perform a simple tally of read collisions in this data, we would fail to reveal the true fraction of phantom molecules (see Definition 2 below) since we cannot observe the true sample of origin s along with the observable outcome y. Section 3.1.4 expand further upon this point. Table 2: The set of all 9 possible outcomes generated by Model 2 conditional on r = 1 and S = 3. The probabilities of the outcomes are determined by the parameter p, the complement of the hopping rate, and the parameter vector π r , representing the proportion of molecules across samples conditional on the read count sum r. Two possible probability assignments are shown: prob (1) corresponds to a model with π r = (0.6, 0.3, 0.1) and p = 0.9 (see Assumption 2), whereas prob (2) corresponds to a model with π r = (0.33, 0.33, 0.33) and p = 0.9. The variable f represents the number of phantom molecules. The proportion of phantoms, which in this case is 0.1, can be obtained by taking the expectation of f over the induced probabilities and dividing by total number of molecules.

Corollaries and Definitions
In addition to inducing the specification of Model 2, Assumption 1 entails the following two corollaries.
Corollary 1 (Label Collision Across Samples). A given cell-umi-gene label combination has a zero probability of co-occurring with more than one sample barcode index. That is, reads annotated with the same label combination can only belong to one sample. This implication formalizes a key insight from Griffiths et al. (2018) that informed their molecular exclusion strategy to reduce the effects of index hopping in 10x Genomics experiments.
Corollary 2 (Number of Molecules). The number of total molecules across all S samples equals the number of unique cell-umi-gene label combinations L. In other words, the number of molecules equals the number of rows in a merged data table of sample read counts, which has been fully joined by the combination of cell, umi, and gene key (Table 3).
Corollary 2 implies that since a distinct molecule is defined by a read (or multiple reads) with a unique cell-umi-gene label, any label collision across samples would result in a total number of observed molecules greater than L. Therefore, to avoid any potential naming confusions, we make the following three clarifying definitions.
Definition 1 (Chimeric Observation). A chimera refers to a hybrid creature composed of the parts of more than one animal. We use the analogy to refer to those read count observations for which (y lj = r) for all j = 1, . . . , S, or in other words, to those cell-umigene labels where we observe reads from more than one sample. We can further classify chimeras by the number of collisions. A k-chimera has reads in k sample categories (e.g. y = (4, 1, 0, 2, 0) is a 3-chimera when S = 5). In contrast, we refer to an observation with no collisions (i.e. a 1-chimera) as a non-chimera.
If there is no sample index misassignment whatsoever, then we expect that all the L observations to be non-chimeric. When the sample index hopping rate is large, we expect to see a correspondingly large proportion of chimeras, observations where there are collisions of labels across the samples.
Definition 2 (Phantom Molecule). For a given observation with label cell-umi-gene, we term molecules observed in samples other than the true source sample as phantom molecules. We call them phantom since their existence is not real, due only to an artifact brought about by index hopping.
It is important to note that a sample index can hop even when we do not observe a label collision. Thus one or even all reads annotated with a given label may be actually hopping reads originating from the true source sample. In such a case, we are unable to determine the true source sample of these molecules. Nonetheless, we can estimate their expected observed proportions.
Definition 3 (Fugue Observation). A disassociative fugue is a personality disorder characterized by unplanned travel, wandering, and even the establishment of a new identity. Similarly, for a given label, when all the reads hop over from an unobserved sample to establish a new identify in other samples, we term all such observations as fugues since we cannot know where they came from since we observe zero reads associated with the true sample of origin.
Given that we have no reason to believe that reads in a particular sample are characterized by a different chemical properties, we can simplify the problem by making the following assumption of uniformity.
Assumption 2 (One parameter to rule them all). The probability of a read keeping its sample index is the same across samples (i.e. p ) and the probability of a read switching the sample index is the same regardless of either its source or target sample (i.e. p h = (1−p) S−1 ). Assumption 2 reduces the number of parameters in P from a total of Q = (S − 1) × S to just a single parameter.
More importantly, Assumption 2 induces a symmetry on the probabilities we assign to the possible counts that lie within the same m−face of an (S − 1)−simplex denoted by S−1 r , which in turn constitutes the r'th component of the Pascal's Simplex denoted by ∧ S .
Definition 4 (Sample Index Hopping Rate). Whereas the term p h = (1−p) S−1 represents the probability that a sample index hops into a one particular target sample, the complement of p, namely 1 − p, refers to the quantity of interest, the Sample Index Hopping Rate, which we abbreviate as SIHR .

Toy Data Example
To illustrate the concepts and definitions in the preceding sections, consider the following toy example where we have three samples S = 3. The data would thus consist of threedimensional discrete vector observations y l ∈ N 3 0 sampled from a multinomial distribution with total sum of counts r, where y l 1 = r, and a probability parameter p ∈ ∆ 2 = (p 1 , . . . , p 3 ) ∈ R 3 3 s=1 p s = 1 and p s ≥ 0 for all s For concreteness, assume that the source sample s is known for each observation and consider what the data would look like when the sample index hopping rate (i.e. probability) Table 3: Toy dataset of read counts and corresponding deduplicated counts (i.e. molecules). The data table shows 4 out of the 10 possible outcomes at read count sum level r = 3 for multiplexed data with S = 3 samples, see Figure 3. All 4 observations are associated with a unique cell-umi-gene label l each and an unobservable true sample of origin s = 1. For each read count y, a vector of deduplicated read counts or molecules x is given. The chimeric number k denotes the number of molecules observed in each outcome and f denotes the number of phantom molecules. Although the chimeric vs. non-chimeric state of an observation can be directly seen from the data, the fugue vs. non-fugue state of an observation cannot be directly inferred, since it depends on knowing the true origin of the reads (i.e. the latent variable s). When there is no index hopping, the only outcomes possible are those classified as non-chimeric non-fugues.
y (reads) x (molecules) l r k f s y 1 y 2 y 3 x 1 x 2 x 3 category 1 3 1 0 1 3 0 0 1 0 0 non-chimeric non-fugue 2 3 1 1 1 0 3 0 0 1 0 non-chimeric fugue 3 3 2 1 1 1 2 0 1 1 0 chimeric non-fugue 4 3 2 2 1 0 is not zero (e.g. p 1 = [.9, .05, .05]) such that the probability concentration bleeds into outcomes with hopped reads (see Table 3). That is, we see that the same label we associate with a unique molecule co-occurs with more than one sample index. When there is no index hopping, for example, when SIHR= 0 (e.g. p 2 = [0, 1, 0]), we observe reads only in one of the three samples (i.e. non-chimeric non-fugues). In Table 3, we show observations corresponding to only 4 out of the 10 possible outcomes at read count level r = 3. However, when r is large, the set of possible outcomes increases exponentially. When S = 3, probabilities of observing these outcomes are given by the corresponding terms of the trinomial expansion of p.
To better visualize the distribution of possible outcomes, consider the case at the read count level r = 3 when reads hop from one sample only, that is, when the true sample of origin is s = 1 (See Figure 3). When r = 3 the number of coefficients and thus corresponding outcomes (i.e. elementary events ) are given by the (r+1)'th triangular number t = r+(S−1) (S−1) , which here equals (3+1)×(3+2) 2 = 10. That is, we group the 10 outcomes, which make up the sample space Y r into a set of three categories: (1) Three non-chimeric outcomes, each lying on one of the three 0−faces (vertices); (2) Six 2−chimeric outcomes, two lying on each of the three 1-faces (edges); (3) One 3−chimeric outcome lying in the middle of a single 2−face (facet) as shown in Figure 1 3 . All 10 possible outcomes (at read count sum r=3) for three samples (S=3) are arranged on the simplex. The sample of origin is s = 1 as shown by the arrows. Outcomes can be classified according to the number of observed molecules (i.e. k-chimera) or the number of hopped reads. The four outcomes corresponding to all 3 reads hopping over are called fugues. The two fugue outcomes in the vertices are termed non-chimeric fugues whereas the other two on the edge are termed chimeric fugues u = r+1 m+1 . The unit probability measure is split over r outcomes instead of t outcomes. Out of the u m−faces, t = r m correspond to non-fugue outcomes whereas f = (u − t) are fugues (i.e. those observations where all the reads hop from their source sample). That is, if in Figure 1, the true sample of origin is s = 1, then a 1-read hop would correspond to two possible outcomes: two 2-chimeric observations each giving rise to one true and one phantom molecule; A 2-read hop would correspond to three possible outcomes: two 2-chimeric observations each giving rise to one true and one phantom molecule and a single 3-chimeric observation giving rise to one true and two phantom molecules; A 3-read hop would would correspond to four possible outcomes: two 2-chimeric fugue observations each giving rise to two phantom molecules and a two non-chimeric fugue observations, each giving rise to one phantom molecule. Note however that when we consider hopping from all S samples, the number of possible outcomes e increases by a factor of S, so when r = 3, we have e = 30 possible outcomes.
The number of observations in data Y is typically high (in the hundreds of millions). Each observed vector of read counts can be categorized as a combination of chimeric/non-chimeric and fugue/non-fugue. For each k-chimeric observation, we potentially have from one and up to S − 1 phantom molecules. For example, the observation y 4 has two phantom molecule and zero real molecules. If there is indeed no index hopping, then the number of phantom molecules would be zero whereas the number of real molecules would equal to L, the number of unique label combinations. Although the probability that all reads get misassigned to the same sample is negligibly low, the effect can be significant for non-chimeric labels for which there is only one or two reads (a group that makes up a large fraction of observations), as Table 2 shows.

Modeling the Distribution of k-chimeras
The distribution of k-chimeras is essentially the distribution of the total number of non-zero counts of a multinomial random variable. With regards to single cell data, determining whether a sample count is nonzero is equivalent to a deduplication process of obtaining molecule counts from read counts. Statistically, deduplication can be formulated as a thresholded latent variable model where each element of a potentially unobservable random vector y l is thresholded into an observable Bernoulli random variable z ls with an indicator function I that detects whether a molecule is observed. That is, where y l ∈ N S 0 , p ∈ [0, 1] S , p 1 = 1, and S ∈ N The marginal distribution of each element of the multinomial observation y l is binomial. That is, where the Iverson bracket notation is used. It follows that the distribution of x li is a Bernoulli with a probability parameter that can be obtained by observing that the expectation of an event defined by an indicator function is equal to the probability of that event. That is, Now given that the Bernoulli random variables (i.e. the elements of x r ) can be treated as independent (for r > S) but not identically distributed, their sum, which indicates the category of the observation (i.e. k−chimera), can be given by the Poisson Binomial distribution. That is, and a PMF given by the following recursive formula ζ Note the independence assumption does not hold when r < S since Pr(K = 0) is not zero even though it should be given that the r reads must be belong to at least one sample. Also note that the mean, µ r , of the Poisson Binomial distribution is equal to the sum of the probabilities. In this case, the π (r) i 's do not affect µ r since they cancel out. That said, they do affect the variance σ 2 l = n i=1 (1 − ζ i )ζ i where it is maximized when the ζ i 's, and accordingly the π (r) i 's, are equal. In this case k l can be reduced to the sum of dependent Bernoulli variables, which for r > S can be approximated as the sum of independent and identically distributed Bernoulli variables. That is, The distribution of phantoms can be obtained from k r by noting that for non-fugue observations, the number of phantoms equals k r − 1. The expected fraction of phantoms at PCR duplication level r is therefore approximately as such. Figure 4 plots the functional relationship between the expected fraction of phantoms and the variables p, r, and S. Distribution of non-chimeras For the case of k = 1, or non-chimeric observations, we can derive a closed form of the distribution by noting that a non-chimera is a count observation y l for which (y li = r) for any i ∈ {1, . . . , S}. We denote the event of observing a non-chimera by a Bernoulli random variable w l := I((y li = r)) with parameter p w given by p w = E(I((y li = r)) = P(y lji = r) As a result, the distribution of observing a non-chimera is given by is the probability of observing a non-chimeric fugue observation with r reads and p r is the probability of observing a non-chimeric non-fugue observation with r reads.

Estimation of the Sample Index Hopping Rate
In the preceding subsection, we were able, with the aid of two simplifying assumptions, to achieve our first goal of building a generative model that probabilistically describes the phenomenon index hopping of multiplexed sample reads. Now we turn to our second goal of validating the model and estimating the index hopping rate from empirical experimental data. However, given the large number of observations that are commonly encountered in practice, we proceed to simplify the problem analytically by reducing the mixture model to a more tractable single parameter model for the distribution of non-chimeric observations. In what follows, we derive the distribution of the sum of non-chimeric observations and show, after making a third simplifying assumption, that the distribution's mean function corresponds to the solution of a differential equation governing a negative growth process. Then, we show how the index hopping rate can be estimated by formulating the problem as generalized linear regression model for binomial counts with a log link function that corresponds to the solution of the differential equation describing negative growth.

Distribution of Sum of non-Chimeras
Given that p w is the same for all observations with the same PCR duplication level , we can sum all the m r Bernoulli random variables in PCR duplication level r to obtain a Binomial distribution over the number of non-chimeras z r conditional on r. That is, for a given r we have The joint sampling distribution of the chimeras at all PCR duplication level values, concatenated as a vector z, can be decomposed as follows Assumption 3 (The Negligible Contribution of the Probability of Observing non-Chimeric Fugues). Given that the the sample index hopping rate we observe in experimental data tends to be very small, SIHR < 0.05, the contribution of the second term (i.e. the probability of non-Chimeric fugues) in the parameter of Model 4 is discernible only when r ≤ 2.
Assumption 3 allows us to simplify Model 4 as follows.
P(z|θ) = R r=1 P(z r |m r , µ(p, r)) Binomial z r |m r , p r Note that the mean function µ(r) = p r can be expressed as such.
µ(r) = p r = e r log(p) = e βr which corresponds to the integral curve solution of the differential equation This is a negative growth curve since β = log(p) ∈ (−∞, 0). We can estimate β by formulating the problem as a generalized linear regression for binomial counts with a systematic component η = βr and a log link function g(µ) = log(µ) such that Note that the standard link function for binomial counts is the logit g(µ) = log( µ 1−µ ). However, it corresponds to the solution of another differential equation, namely dµ(r) dr = β µ(r) 1 − µ(r) Although the log link function is not symmetric, η is defined only on the negative real line. Furthermore, the log link renders the parameters interpretable since β is basically just the log of p the parameter of interest.
The relationship between the number of non-chimeras z r and the sample index hopping rate (1 − p) at a given PCR duplication level r and with an m r number of observations can be formulated as a generalized linear regression model with a log link function as follows.
An estimate of the sample index hopping rate can be obtained from the regression coefficient as such We can use the estimate of the sample index hopping rate to obtain the expected number of hopped reads at each PCR duplication level, namely (m r × r × SIHR). We can also plug it into Equation 3 to compute the expected Fraction of Phantom Molecules (FPM) for the entire dataset. That is where m = (m 1 , . . . , m R ) and f = (E(f 1 ), . . . , E(f R )). We would like to note that we can do away with Assumption 3 and estimate p in Model 4 using a likelihood optimization procedure. However, the optimization algorithm might not converge or it can get stuck at a local optimum. Furthermore, compared to the GLM approach, it is not straightforward to obtain standard errors or any measure of uncertainty by optimizing the likelihood. We provide code in the paper's GitHub repository showing the likelihood optimization approach and its overall similar results to the GLM approach.

Estimating the Sample Barcode Index Hopping Rate
A sample in a 10X Genomics experiment is actually labelled by, not one, but four different barcodes. By considering only the sample labels instead of the sample barcode, the statistical approach outlined in this paper models sample index hopping across samples rather than across individual barcodes. Nonetheless, we can still estimate the sample barcode index hopping rate by noting that when we treat all reads of a given sample as having the same sample index, we are effectively collapsing a 4 × S-dimensional multinomial vector of barcode sample index hopping probabilities p b into an S-dimensional vector of sample index hopping probabilities p. That is, since the read counts are modeled with a multinomial, we can fuse the four barcode count events within a sample by summing their read counts and corresponding probabilities. Accordingly, the relationship between the two parameters of interest p b and p can be expressed as such.
where p b is probability that a given sample barcode index does not hop. Accordingly, we can obtain an estimate of the sample barcode index hopping rate by expressing p b in terms of p and S.
Notice that the multiplicative factor is at its maximum (i.e. 1.75) when S = 2. As S increases however, it approaches 1 and the two terms become approximately equal.

Inferring the True Sample of Origin
In the multinomial mixture model (Model 2), we denote the true sample of origin of a sample index associated with a read count by the latent variable s l . One important advantage of formulating the process of sample index hopping as a probabilistic model is that we can now derive, via Bayes theorem, the posterior probability distribution of the unobservable random variable s l and quantify our residual uncertainty about the true sample of origin given the observed read count and the set of parameters p and the molecular proportions Π := {π r } R r=1 that govern the mixture model. To derive the posterior distribution of s, we would need the plug-in estimates for the p and Π. In Subsection 3.2, we showed how to estimate p. In what follows, we derive a formula that allows us to obtain estimates of the molecular proportion parameter vectors Π from the distribution of read counts conditional on the PCR duplication level.

Estimating the Molecular Proportions Complexity Profile
Let u r = (u r1 , . . . u rS )(= mr l s l ) represent the unobserved molecule (i.e. UMIs) counts at PCR duplication level r across S samples. The distribution of u r can be obtained by noting that sum of m r identically distributed categorical random variables with parameter vector π r is a multinomial random variable with the same parameter vector. That is, Now, the expected molecule count across samples in the entire data is given as such.
However, we do not observe s l , and the expected total number of observed molecules in the data can be much greater due to the presence of phantom molecules brought about by sample index hopping.
Let v r = mr l y l denote the total read count at PCR duplication level r. In what follows, it helps to think of the m r observations as members of S partitions, where the size of partition s is given by π rs , the proportion of molecules in Sample s at PCR duplication level r. Now since E(v r ) = E( mr l y l ) = mr l E s (E y|s (y l )) = (m r × r) S s=1 p s π rs therefore E(v rs ) = (m r × r) π rs (S × p − 1) + (1 − p) S − 1 and so we can obtain an estimate for π rs from the observed proportion of read countsv rs for sample s observations at PCR duplication level r. That is, Notice that when there is no sample index hopping (i.e. p = 1), the proportion of molecules equals the proportion of reads, that isπ rs =v rs . Also, note that for the denominator to be positive, the proportion of readsv rs > 1−p S−1 for all s = 1, . . . , S. In empirical data, when this relationship is violated, we can setπ rs = 10 −6 and renormalizeπ r accordingly.

The Posterior Distribution of the True Sample of Origin
The posterior distribution can be obtained as follows. P(s l | y l ; r,p,π r ) = P(y l | s l ; r,p)P(s l ;π r ) P(y l ) = Multinomial(y l | s l ; r,p s l ) Categorical(s l ;π r ) where the plug-in estimatesπ rs is the proportion of molecules in sample s at PCR duplication level r as computed by Equation 6 andp is the complement of the sample index hopping rate as estimated by Model 5. We label the sample with the maximum posterior probability as the true sample of origin. That is s (t) l = argmax(P(s l | y l ; r,p,π r ) The corresponding posterior probability of s (t) l is simply the maximum over the posterior probability vector, that is q l |y l = max(P(s l | y l ; r,p,π r ) Whenπ r is uniform, the maximum q l |y l can have duplicated values. However, such an outcome is extremely unlikely given the high variability of the molecular proportions that characterize empirical data and therefore will not be considered.

Purging Phantom Molecules
For each of the L observations, we label the molecule corresponding to s (t) l as a real molecule and all the others (with nonzero reads) as phantom molecules. Such a procedure achieves the minimum possible number of false negatives at the expense of false positives. However, it could be well the case that sacrificing a potential real molecule is worth more than retaining a phantom molecule in the dataset. In scientific applications, most often, a weak signal is preferable to an artifactual signal. That said, to achieve the minimum possible number of false positives, it might be the case that the entire data would need to be discarded. An alternative approach attempts to find the optimal trade-off that minimizes both the false positive and false negative rates.
To optimally minimize the false positive rate, we would need to find the optimal cutoff value q * below which, observations with s (t) l that we labeled as real molecules are now relabeled as phantom -in other words, we discard them along with previously labeled phantom molecules. In order to determine q * , we would need to work with the marginal posterior cumulative distribution function of q, which does not have a closed-form expression but which can be expressed as such.
Here, O consists of all the outcomes y that correspond to the coefficients in r'th component of the Pascal's S-Simplex ∧ S r . Whereas P(y = o (r) |n) is given by the multinomial distribution can be computed numerically for the first dozen values of PCR duplication level r, the marginal distribution of r, P(n), can be approximated by the observed proportion of PCR duplication levels which we denoted by M (.) in Model 2. However, since r can have large values in empirical data, we can work with the empirical distribution instead and approximate the marginal distribution of P (y) by the observed relative frequency distribution of outcomes (e.g. the proportion of y = (1, 2, 0, 0) observations in the data).

Determining the Optimal False Positive Rate
The approach outlined in the previous section can be thought of as a classification task in which we attempt to predict whether a given molecule is a phantom molecule, which we consequently discard, or a real molecule, in which case we retain. We would like to minimize both the error that a molecule we deem real is in fact a phantom (i.e. false positive) and the error that a molecule we deem phantom is in fact a real molecule (i.e. false negative). In a dataset consisting of L observations and M molecules, the maximum number of real molecules is L. In what follows, it helps to work with proportions relative to the number of observations L. Accordingly, we define u = M −L L as the molecule inflation factor, a lower limit on the fraction of molecules we can predict to be phantom (with respect to L). Although the initial total number of molecules in the entire data before index hopping is L, the upper limit of the number of real molecules we are able to predict as real is slightly less than L due to the existence of fugue observations in the dataset, each one of which would lead to one extra true phantom molecule to be unaccounted for. We term these molecules as fugue phantoms. L therefore would equal the sum of real molecules and fugue phantoms. Accordingly, if g is the proportion of fugue observations in the dataset, which we can obtain from Equation 10 (see below), then there would be a total of (u + g) true phantom molecules and a total of (1 − g) true real molecules in the data. For example, for a dataset with L = 100 observation, a total number of molecules M = 150, and an estimated proportion of fugues g = 0.05, the molecule inflation factor u would equal 0.5 so that the total number of true phantom molecules is given by (0.5 + 0.05) × L = 55 and the total number of real molecules is given by (1 − 0.05) × L = 95.
Given that the number of phantom molecules in a given dataset is given by L × (u + g), there are three courses of action we can choose to take.

No Purging:
We keep the data as it is. By doing so, we basically label all the phantom molecules as real, which would correspond to L × (u + g) false positives (i.e. FPR=1).
That is, we are in effect incorrectly classifying all true phantom molecules as real.

No Discarding:
We purge the phantom molecules by reassigning the reads to the sample with largest posterior probability as described in Subsection 3.3. By doing so, we can drastically decrease the number of false positives while incurring relatively small number of false negatives.
3. Discarding Below Cutoff : We can decrease the false positives further at the cost of a slight marginal increase in false negatives by choosing a cutoff q * below which predicted real molecules are classified as phantom instead. That is, in our classification task, we label the molecule with the maximum posterior probability (q) above or equal to a given threshold q * as a real molecule while the remaining molecules are all labelled as phantoms. Furthermore, all the molecules in observations whose corresponding q falls below the cutoff are also labeled as predicted phantoms even though a proportion of them are real molecules which we cannot confidently classify (i.e. false negatives). Effectively, all molecules we label as phantom get eliminated from the dataset.
In what follows, it would be easier to work with the complement of q, which we denote qr = 1 − q, which is the probability of a molecule originating from a sample other than the one with maximum posterior probability. For a given selected threshold value qr * , we let o * = F qr (qr * ) := Pr(qr ≤ qr * ) denote the proportion of observations whose corresponding predicted real molecules we retain. Here, F qr is the empirical CDF of qr. Consequently, the fraction of predicted real molecules is o * and the fraction of predicted phantom molecules is 1 + u − o * , since the total fraction of molecules must sum up to u + 1 ( see Table 4). Out of the (o * ) predicted real molecules, the proportion of false positives is given by the Table 4: The confusion matrix for classifying molecules as real or as phantom. Given a data table of read counts with L observations, the total number of molecules would be (1 + u) × L where u is the molecule inflation factor. The variable o * denotes the proportion of observations that fall above threshold q * and such that o * × L would be the number of molecules we label as real. The constant g is the proportion of fugue observations estimated for that given dataset.

Phantom
Real Total Given that the values of qr tend to concentrate close to zero, we provide the following additional transformed score that is easier to work with when plotting graphs. qs = 160 + 10 log 10 (qr + 10 −16 ) Here qs ranges from 0 to 160 as qr ranges from 0 to 1.
Note that the empirical cumulative distribution function of qr is discrete (i.e. its ECDF increases by jump discontinuities only) even though it takes real values in R [0,1] . Accordingly, the threshold qr * should be set such that very few potential real molecules are discarded for any marginal decrease in false positives. The optimal value for the threshold qr * is the one that simultaneously minimizes both the number of FP counts and FN counts, but note that discarding lower quality data would lead to a small FP but also a large FN.
Youden's J statistic. We can use Youden's J statistic = 1-(FPR +FNR) to find the optimal cut-off value qr * that maximizes J, where (FPR) := F P (u+g) and FNR := F N (1−g) = o * +F P −g (1−g) . The cut-off given by Youden's J is optimal in the sense that it minimizes the probability of random guessing when we give equal weight to FPR and FNR, or equivalently to sensitivity and specificity (Youden, 1950). Nonetheless, optimizing J might not be the most appropriate choice for the problem at hand since the number of phantom molecules is an order of magnitude smaller than the number of real molecules. That is, FPR and FNR are proportions of unbalanced classes, whose denominators depend on u, which is a function of the hopping rate, and g, which not only depends on the hopping rate, but whose estimate can be biased.
TORC approach. An alternative approach for minimizing both the number of FP counts and FN counts considers the ratio of marginal increase in FNs over the marginal decrease in FNs as a trade-off to be specified by the researcher, the trade-off ratio cutoff torc parameter, which represents the number of real molecules one is willing to incorrectly discard in order to correctly purge one phantom molecule.

Computing the Proportion of Fugue Observations (g)
To obtain the proportion of all fugues g, we need to take into account the proportion of chimeric fugues g c as well the proportion of non-chimeric fugues, which we denote by g 0 . We can compute g 0 by multiplying the observed relative frequency proportion vector of the read count levels (i.e. m) by the vector of non-chimeric fugue probabilities, whose elements are given by p f 0 (r) and which we derived in Section 3.1.5.
The probability of observing chimeric fugues can be obtained as such.
where the set T r consists of outcomes corresponding to events where all the r reads hopped to two or more target samples: the outcomes (0, 2, 1) and (0, 1, 2) in Figure 3. For r = 1, the set T r is empty; for r = 2, there are S−1 r = (S−1)(S−2) 2 such outcomes. For r > 3, it becomes increasing hard to enumerate and compute the probabilities for all the possible outcomes, especially when S is large as well. However, the value of 1−p S−1 r becomes negligibly small for r > 3 given that p tends to be very close to 1. Furthermore, empirical data is characterized by a highly skewed distribution of PCR duplication levels such that the proportion of observations in the first few r values makes up a substantial fraction of all observations. Therefore even if p fc (r) was not negligible for r > 3, it would ultimately be weighted down by the respective m r . Under this assumption, the proportion of fugue observations can therefore be accurately approximated as such.
For example when p = .99, m 1 = 0.20, m 2 = 0.15, and m 3 = 0.1 g = 0.0020151. Or in other words, only 0.2 % of observations are fugues. If all the data is restricted to have PCR duplication level r = 1 (i.e. m 1 = 1), then g = (1 − p) would be the maximum possible proportion of fugue observations.

Overview of Computational Workflow
Here we provide a rough summary of the data analysis steps that comprise the PhantomPurge workflow. The code implementing the workflow and a set of reproducible R Markdown notebooks detailing the data analysis steps are available on the paper's GitHub repository.
Step 1: creating a joined read counts table. In order to create the joined data table of gene-mapped read counts across samples, we load from the molecule_info.h5 file of each sample the data corresponding to the following four fields: cell-barcode, gene, umi, and reads. We then join the data from all samples into a single data table that is keyed by a unique cell barcode-gene-umi combination ID. In addition to the unique label ID, each row of the data table contains read counts across all the S samples.
Step 2: creating an outcome counts table. Given that the number of observations in the joined read counts data table tends to be in the hundreds of millions, we then collapse observations with similar outcomes and tally their frequency of occurrence. Working with a data table of outcome tallies drastically reduces the number of computations required. A few grouping variables are also added to the joined data table of outcome tallies to be used for subsequent analysis steps. The resulting data table contains the following additional variables: an outcome character variable (e.g. "(1,0,1,1)") to group the observations into unique outcomes, a count variable (n) denoting the number of times a particular outcome was observed in the data, a PCR duplication level variable (r), and a (k_chimera) variable that counts the number of molecules we observe in each outcome.
Step 3: estimating the sample index hopping rate. Using the introduced grouping variable, we tallied the number of observations that are chimeric and those that are nonchimeric at each PCR duplication level r. The resulting data table is then used as input to the GLM Model 5 in order to estimate, from the proportion of chimeric observations, the sample index hopping rate.
Step 4: inferring the true sample of origin and reassigning reads. To compute the posterior probability of the true sample index (the latent variable s) for each unique observed outcome, we first summarized from the data the conditional and marginal distributions of reads. The summaries are represented by a vector of PCR duplication level proportions m, which we used to compute the empirical marginal distribution of q, and a set of I across-samples read count proportion vectors from which we estimated the proportion of molecules across samplesπ r , one at each PCR duplication level r, that in turn we plugged in to calculate the conditional posterior probability q|y. For each outcome, the posterior probabilities of the possible true samples of origin are computed and the index of the sample with the maximum posterior probability along with posterior probability itself is added to the original joined read count table. We then used the predicted true sample of origin and its associated posterior probability to reassign reads to their predicted sample of origin.
Step 5: determining the cutoff for purging predicted phantom molecules In order to remove predicted phantom molecules from the data while minimizing the rate of false positives and false negatives, we computed the trade-off ratio (tor) statistic by dividing the marginal increase in FNs over the marginal decrease of FPs for each observed unique qr * value. Predicted real molecules associated with outcomes whose corresponding tor values are greater than the user-specified torc are retained ( default value is 3). To purge the data, the read counts are first deduplicated to obtain a table of molecule (i.e. UMI) counts. After purging, the molecule counts are collapsed over gene labels to produce a gene-by-cell umi-count expression matrices for all the samples sequenced in the same lane.
Step 6: identifying RNA-containing cells before and after purging To distinguish cells from empty droplets, the cell-calling algorithm EmptyDrops (Lun et al., 2019) is called. The set of called cells in each sample is used to filter out background cell-barcodes from the deduplicated purged UMI count matrices. The filtered matrices are then saved to file in order to be used for downstream statistical analyses. By comparing the results of the cell-calling algorithm from both purged and unpurged data, one can then examine the extent of contamination of phantom molecules on data that would have otherwise not been purged.

Method's Limitations
The work presented in this paper has the following limitations.
The non-negligible probability of a cell-umi-gene label collision across samples when the number of samples is large. As the number of multiplexed samples increases, Corollary 1 would no longer hold since the probability of observing, in more than one sample, a given cell-umi-gene label combination becomes non-negligible. That said, in single cell experiments, multiplexing more than 16 samples on a single lane is not commonly done given the smaller library size and lower genomic coverage that follows as a result. Furthermore, the adoption of a longer UMI index in the latest 10X Genomics Single Cell 3' v3 assay would further reduce the probability of potential collisions, thus rendering this concern less of a problem.
The sensitivity of the FPR-minimizing cutoff on the ECDF of q. As we have already mentioned, the marginal distribution of q given by Equation 7 has no closed-form nor is it feasible to compute the exact probabilities for all its possible outcomes even when r is not large. As r increases, we tend to observe in the data an increasingly fewer proportion of all possible outcomes and subsequently we would expect the resulting ECDF to deviate from the theoretical distribution. That said, the deviations would only affect the classification measures and the determination of the cutoff o * whether optimally or not, but they would not affect the read reassignment procedure and the initial purging that retains all predicted real molecules (i.e. when we set o * = 1) since the marginal distribution is not involved at this step.
Memory requirements. For data generated on the Hiseq 4000, the analysis workflow does not require computational requirements beyond what is found on a regular desktop with 32G of RAM. However, for Novaseq 6000 data with a large number of multiplexed samples (e.g. S > 8), more memory would be needed -up to 150G if not more. In particular, the first step in the workflow consisting of joining data from all samples into a datatable keyed by cell-barcode, UMI, and gene ID is the most memory intensive.
Incorporating the sample barcode indices in the model. In this work, we formulated a model for index hopping that treats all the four distinct sample barcodes present in a given sample as identical. Even though we showed in Subsection 3.2.2 that the sample barcode index hopping rate can be derived from the model nonetheless, we have not used information from the sample barcode indices to reassign reads or to purge phantom molecules. The model can potentially be expanded to incorporate the 4 sample barcodes, but such an extension would be accomplished at the cost of increased, if not prohibitive, computational and memory requirements since the workflow would need to start with the FASTQ files, not the output of the CellRanger pipeline and furthermore, the read count data table would need to be joined and keyed across 4 times the number of samples, potentially requiring more memory than would be available on some clusters. We would like to point out that the main reason we chose to formulate a model for index hopping at the level of samples and not individual sample barcodes is that in the end, sample index hopping between reads belonging to the same sample would not have any effect on downstream analyses since no phantom molecules would be generated when a read in a given sample swaps its sample barcode index with one of the other three barcode indices that are assigned to the same sample.

Data Preprocessing
We applied the proposed model on three publicly available 10X Genomics scRNA-seq datasets: (1) a control non-multiplexed dataset in which each sample was sequenced on a separate lane of HiSeq 2500; (2) a multiplexed dataset sequenced on HiSeq 4000; and (3) a multiplexed dataset sequenced on NovaSeq 6000. The HiSeq 2500 and HiSeq 4000 datasets consist of 8 libraries of mouse epithelial cells, which have also been used previously by Griffiths et al. (2018) for analysis of sample index hopping. The two datasets were downloaded from the authors' host server using the get_data.sh script available on our paper's GitHub repo (for a more detailed description of the data, please refer to the original publication, Bach et al. (2017)). The third dataset was obtained from the Tabula Muris project's repository. It consisted of 16 libraries (i.e. the P7 libraries) of mouse tissue samples, which were pooled and multiplexed on two lanes of an S2 flowcell in a single NovaSeq 6000 sequencing run (Consortium, 2018). BAM files for the 16 samples were downloaded from the NIH's SRA data repository (SRA accession number SRP131661) and converted back to FASTQ files using the 10X Genomics bamtofastq utility. Cell Ranger 3.0.0 was run with the default options on each set of samples multiplexed on the same lane. Figure 5 depicts the molecular proportions complexity profile plots for all four datasets. The same three samples (B1, B2, and D2) in both the Hiseq 2500 and Hiseq 4000 datasets show molecular proportion curves indicative of low library complexity. Indeed even though the number of reads is approximately equal across the 8 samples, the number of molecules differ drastically, by an order of 100 or more, and are lowest in these three samples (see analyses notebooks). Furthermore, whereas the proportion curve of Sample D2 peaks at a moderate PCR duplicate level then tapers off, those of B1 and B2 start picking up and maintain a steady proportion for a wide range of high duplicate level values, up to r = 1706 and r = 3744, for Hiseq 2500 and Hiseq 4000, respectively. These trends are indicative of problematic samples; incidentally, Samples B1, B2 are the two samples that our analysis identifies to have the largest fraction of phantom molecules. As for the Novaseq 6000 datasets, the curves for Samples P 7_1 and P 7_8 indicate low complexity libraries. This is also confirmed by the analysis since both samples turned out to have the two lowest number of molecules but the two highest fractions of phantom molecules among the 16 multiplexed samples.

Estimates of the Sample Index Hopping Rate
The model fits for the four datasets are summarized in Table 5 with estimates of p along with margin of errors corresponding to the 99 percent confidence intervals. The table also lists the derived estimates of the sample index hopping rates (SIHR) and the sample barcode index hopping rates (SBIHR). As the figures show, the non-multiplexed (HiSeq2500) dataset has a much lower (SIHR) than the 3 multiplexed datasets, whose estimates show high consistency. Notice how a slight increase in the sample index hopping rate translates into a sizable increase in the molecule inflation factor and therefore in the proportion of phantom molecules in the dataset. As Figure 6 shows, the GLM fits the four datasets rather well, especially for low r values for which there is almost negligible missingness. Also, notice that the observed chimeric proportions show a downward deviation from the predicted mean trend at high PCR duplication levels, especially for those top 1 percent of observations that are characterized by Table 5: Estimates of the sample index hopping rate SIHR for one non-multiplexed (HiSeq2500) and three multiplexed sample libraries from two experiments sequenced on two different platforms. In the Novaseq dataset, the same sample libraries were multiplexed on two lanes.p is the estimate of the model parameter (i.e. the complement of the hopping probability); m.e. is the margin of error corresponding to a 99% confidence interval; u is the molecule inflation factor, a summary measure that represents a lower limit on the number of phantom molecules in the data; g is the proportion of fugue observations; and PPM= u+g 1+u+g is the Proportion of Phantom Molecules in the dataset. noisiness and sparsity. In fact, the downward trend tends to occur right after the cumulative proportion of molecules in the dataset begins to plateau. The downward trend we observe in the proportion of chimeras seems to indicate a slight decrease in hopping rate that kicks in at high PCR duplication levels. The downward trend in the proportion of chimeras that we observe to deviate from from the model predicted trend could potentially be explained by sample dropout at high r values. That is, when the molecules of only a subset of the samples are PCR duplicated at high levels, the multinomial model we specified can no longer assume that we have S samples since sample indices cannot hop to nonexistent sample molecules. Accordingly, we expect the curve to differ when we have S samples compared to when we have fewer. Another aspect of the observed chimeric proportions becomes apparent at very high r values. The observed trend becomes noisy and fluctuating. This can be explained by the the high proportion of missingness at those levels. That is, there are finite observations at a given PCR duplication level r, and therefore we will observe only a few of the possible outcomes, especially when r > 25, where the number of potential outcomes are in the millions. When this happens, the variability of the mean estimate would increase as the number of observations at a given r decreases.

Minimizing False Positive by Discarding Molecules
As Table 6 shows, leaving the data as it is (i.e. nopurging) comes at cost in the form of a large number of false positives (i.e. true phantom molecules that we incorrectly label as real molecules). In contrast, by reassigning reads to the sample with the maximum posterior probability and retaining only those associated molecules (i.e. nodiscaring), we are able to substantially reduce the number of false positives at a very little cost of removing true real molecules. We can further minimize the number of false positives by discarding predicted real molecules that have low posterior probability. By using the default torc = 3 value, we retain molecules with tor values that are equal or greater than the maximum tor value not exceeding torc = 3. In the table, the rows corresponding to the above approach shows the corresponding outcome, inferred sample of origin s, posterior probability qr, false positive (FP) and false negative (FN) counts, and the tor value for the observed outcome with the next higher tor value, the one with a lower false positive count.

The Extent of Contamination by Phantom Molecules
If we had chosen not to decrease the number of false positives by discarding molecules, or in other words if we had set o * = 1, then the number of molecules we predict as phantoms for the entire dataset would be given by (u + g) × L, out of which the number of true phantoms are given by (TN ×L),and false phantoms by (FN ×L). However, the proportion of phantoms for individual samples can depart drastically from these marginal measures for the entire dataset as a whole since library complexity varies widely across samples with some samples even consisting mostly of empty cells. Indeed, this is the case for the Hiseq 4000 dataset. As Table 7 shows, the proportion of phantom molecules for this dataset ranges from 0 .012 to 0.864 whereas the same proportion ranges from 0.0132 to 0.176 in both of the two Novaseq 6000 datasets (see Tables 8 and 8). In contrast, in the Hiseq 2500 dataset, which had not been multiplexed and whose hopping rate we estimated to be extremely small, we see that the range of predicted proportion of phantom molecules ranges from 0.0002 to .02 (Table 7). In the two tables, the number of molecules is given by the second column. The proportion of total molecules predicted as real and therefore retained is given by the real column. The proportion of total molecules that predicted as real but discarded since their q values fall below the corresponding tor cutoff is given by the false phantoms column. The proportion of molecules predicted as phantom is given by the phantoms column. The two proportions columns give show the discrepancy between library size and library complexity across the samples. The last column measures the extent of PCR Amplification bias as given by RMR, the total number of mapped Reads over total number of Molecules (i.e. library size divided by library complexity) ratio. Note that RMR tends to be a good indicator of which samples would have a high proportion of phantoms and also which would have a high proportion of false predicted phantoms (i.e. real molecules which we incorrectly classify as phantoms). For example, Sample P 7_8, the sample with the highest RMR in both lanes.  Given that a substantial percentage of mapped molecules in the dataset originate from empty droplets, one can argue that what matters in the final analysis is whether there turns out to be a substantial degree of phantom contamination in the subset of molecules originating from RNA-containing droplets. In general, there are two potential effects that phantom molecules bring about: (1) the emergence of phantom cells and more commonly (2) causing cell-calling algorithms to miss-classify empty droplets as cells and vice versa.
To assess the extent and effects of contamination by phantom molecules in this subset, we used the EmptyDrops (Lun et al., 2019) algorithm to call cell-barcodes in order to determine whether a cell-barcode originated from a cell or an empty droplet. For each dataset, we ran EmptyDrops the first time on the unpurged data and then a second time of purged data. To compare the results of the two runs, all barcodes classified as either cell or empty by the algorithm were divided into three groups each as shown in Tables 9and 10. In the tables, the Consensus columns show the number of cell-barcodes that maintain the same classification as cell or empty, across the two runs. The Phantom columns show the number of cell-barcodes whose associated molecules were all predicted as phantom molecules (true and false phantoms). The Re-classified columns show the number of barcodes that have been reclassified after running the cell-calling algorithm of purged data. For example, we see that for the A1 sample in the Hiseq 4000 dataset, 13 barcode that was previously classified as background, were re-classified as cells after purging whereas 105 cell-barcodes that were classified as cell before purging were re-classified as empty after purging. For sample, the extent of the reclassification is more pronounced. If we had called cells without first purging the dataset of phantom molecules, we would have believed that we have called a total of 1023 (= 16+1007) cells, whereas re-running cell-calling on purged data would have produced no more than 16 cells.
For the Novaseq 6000 datasets, the effects of contamination are less severe than those see in the Hiseq 4000 dataset. For many samples the number of barcodes that are reclassified as empty can make a substantial fraction of total cells (e.g. Sample P 7_01 with 223 barcodes). Note that we opted to treat the two lanes separately for the cell-calling stage of the analysis to evaluate the consistency of the proposed approach on two datasets that only differ by flowcell sequencing lane. For downstream analysis, data from the two lanes should be combined before cell-calling is performed.

Data Preprocessing
Two 10X Genomics scRNA-seq sample libraries were sequenced in two conditions. In the first condition, the samples were multiplexed on the same lane. In the second condition, two sample libraries were sequenced on two separate lanes of HiSeq 4000 (this non-multiplexed condition provides a ground truth for the true sample of origin of each CUG). Cell Ranger 3.0.0 was run with the default options on each of the four samples (two sample multiplexed on the same lane and the same two samples non multiplexed. A joined read counts table was created and gene IDs anonymized. The joined table had 16,547,728 unique CUGs, out of which 9,252,147 CUGs were present in both the multiplexed and nonmulitplexed samples. These were retained and a column containing goundtruth labels was added to the resulting inner joined datatable. Rows corresponding to colliding CUGs were filtered out and the table saved to file (see the reproducible R markdown notebook validation_hiseq4000_1.nb.html for details). Both read counts tables was deposited on Zenodo (https://doi.org/10.5281/zenodo.3267922) and linked on the GitHub project's website.

Validation
Assumption I. Corollary 1 of Assumption 1 states that the probability of a CUG collision is zero. To examine the extent of the assumption's validity, we counted the number of colliding CUGs in the inner joined data table. We found that there are only 52 (real, real) colliding CUGs out of a total of 9,252,147, or in other words, a collision rate of approximately 0.00001 for the case of two samples. Accordingly, for practical purposes, we can consider Assumption I to be valid given that collision rate is close to zero and is more than 3 orders of magnitude smaller than the CUG collision rates observed in data contaminated by sample index hopping. Assumption II Assumption 2 states that the hopping probability of a sequencing read's sample index is the same regardless of either the sample index's source or target. To validate the assumption we counted the fraction of hopped reads originating from each sample at the entire range of PCR duplication levels ( see Figure 8). As the figure shows, the estimated sample index hopping rate 0.00320 ± 0.00001 is very close to the marginal ground truth estimate of 0.00326 (i.e. the true mean). That said there are two trends that seem to slightly depart from the model's assumption. First, there is a minor difference between the two samples' proportion of hopped reads, SIHR 1 = 0.00346 vs SIHR 2 = 0.00302 (the subscript denotes the source sample). Second, the ground truth estimates for both samples start out at higher proportion values but stabilize starting at r = 10 (see the reproducible R markdown notebook validation_hiseq4000_2.nb.html for details). These trends could be sub-sampling artifacts of the filtering procedure step in which we only retained CUGs that are observed in both the multiplexed and non-multiplexed samples (each of which has twice the number of reads). Alternatively, these trends could be persistent across experiments in which case the model we have proposed captures rather well but not perfectly the underlying mechanism of sample index hopping. A model that is governed by several sample-specific hopping rates that are also dependent on the number of duplicated reads would provide better accuracy but at the cost of intractable computational and mathematical complexity. Even then, the improved accuracy in estimating the hopping rate would not affect the purging procedure greatly given that the molecular complexity profiles plays an important role in assigning reads to their sample of origin.

Comparing and Evaluating Classification Performance
We evaluate the tor cutoff approach by contrasting it with two alternative actions: no purging, where we leave the data as it is, and no discarding, where we purge predicted phantom molecules but refrain from discarding real molecules whose inferred sample-oforigins have low posterior probabilities. We also compare the classification performance of the proposed method to the maximum read fraction molecule exclusion approach (Griffiths et al., 2018), in which the sample-of-origin molecule is deemed the one with the majority of the reads (default ≥ 80%). As Table 11 and Figure 4.2.3, the torc approach achieves a false positive count 10,116 that is very close to the lowest possible (i.e. 9,995, the number of fugue molecules in the dataset) while maintaining a low number of false negatives, 7,685 as well. In contrast, although the maximum fraction approach achieves a lower number of false positives, the reduction of 116 false positives is offset by a substantial number of false negatives, 16,094, an increase of 8,409 over torc approach. (2) Purging without discarding observations (i.e. predicted real molecules); (3) Discarding at a tor cut-off equal to 3 (default); (4) Discarding using the maximum fraction exclusion strategy (default ≥ 0.8). Points closer to the origin are more optimal. Right. Zoomed-in plot showing observed effects. Figure 10: Left Cell-gene collisions and corresponding UMI counts in two multiplexed samples before purgin. Each point is the gene expression profile associated with a distinct cell-gene label. The blue and green dots correspond to phantom molecules originated from Sample 1 and Sample 2, respectively. The red dots correspond to phantom molecules that originate from both samples. Right. After purging phantom molecules.

Extent of contamination
Although the proportion of hopped reads in Sample 1 and Sample 2 are 0.0035 and 0.0030, respectively, the corresponding proportion of phantom molecules (PPM) are an order of magnitude higher, at 0.085 and 0.037, respectively. Furthermore, the percentage of cells that contain at least one phantom cell is approximately 20%. For illustration see Figure  11 which plots, for each affected cell, the number of phantom molecules against the total number of molecules.  Figure 11: The ratio of phantom to total molecules in each affected cells in multiplexed data (Sample 2). Blue points are cells whose molecules are all phantom.

Simulation
To compare the performance of the proposed approach to the maximum read fraction molecule exclusion approach (Griffiths et al., 2018) for a case when there are more than two samples, as was the case in the validation data, we simulated data for (S = 8) samples using the molecular complexity profile computed for the mouse epithelial cells Hiseq 4000 data. For computational feasibility, outcomes were simulated for all r values up to a maximum of 15 for a range of four p values. The simulation results show (see Table 4.3 and Figure 12) that in contrast to the maximum read fraction approach, the tor (default maximum value not exceeding 3) approach gives a false negative proportion that is about 3 times lower while also maintaining a comparable false positive proportion, which progressively becomes relatively smaller as the hopping rate increases. More importantly, as the tor curves show in Figure 12, the entire range of possible tor cutoff values provide a more optimal trade-off choice than the maximum read fraction approach (represented by three possible thresholds mf = (0.6, 0.8, 0.9)) for balancing the false negative and positive rates.   Figure 12: Performance on simulated data. The false positive and false negative proportions comparing the maximum read fraction and the tor approaches for four evenly spaced values of p (the complement of the hopping probability). Three choices of the maximum read fraction threshold were chosen (0.6, 0.8, 0.9). The default tor cutoff is marked with a plus superimposed on the curve of possible tor values within 1-10 range. The dots corresponding to the observed tor values are in gradient color.