A Haystack Heuristic for Autoimmune Disease Biomarker Discovery Using Next-Gen Immune Repertoire Sequencing Data

Large-scale DNA sequencing of immunological repertoires offers an opportunity for the discovery of novel biomarkers for autoimmune disease. Available bioinformatics techniques however, are not adequately suited for elucidating possible biomarker candidates from within large immunosequencing datasets due to unsatisfactory scalability and sensitivity. Here, we present the Haystack Heuristic, an algorithm customized to computationally extract disease-associated motifs from next-generation-sequenced repertoires by contrasting disease and healthy subjects. This technique employs a local-search graph-theory approach to discover novel motifs in patient data. We apply the Haystack Heuristic to nine million B-cell receptor sequences obtained from nearly 100 individuals in order to elucidate a new motif that is significantly associated with multiple sclerosis. Our results demonstrate the effectiveness of the Haystack Heuristic in computing possible biomarker candidates from high throughput sequencing data and could be generalized to other datasets.

Suppose we encounter Dom Dx, which matches any patient with probability px. The match to a of A total patients occurs with probability P (
We represent Pr(Mx,a) as the probability that a given motif Mx is a DOM with a total patients matches. Our calculations show that Pr(Mx,a) = | | (1--px) |A|+|B|--a px a . All the variables in that formula may be quantifiably obtained, except for px. Variable px remains an unknown quantity that varies for each motif Mx. Though we are unable to determine an a priori value for px, we are able to calculate the px quantity that will maximize the value of Pr(Mx,a). These calculations are discussed in the subsequent paragraph.
In order to maximize Pr(Mx,a), we must maximize (1--px) |A|+|B|--a px a relative to px. This is equivalent to maximizing the function f(p) = (1--p) x p y relative to p. With some elementary calculus, it is trivial to show that ( )

= yp y--1 (1--p) x -x(1--p) x--1 p y . When
( ) = 0, equation may be solved for p to show that p = y/(x+y). Setting p to y/(x+y) will maximize the value of f(p). Consequently, setting px to equal a/(|A| + |B|) will maximize the value of

Supplementary Materials 3 S3.1 Unequal Frequency Analysis When Category B Matches Hold a Higher Likelihood
Suppose that a given motif M matches to category A patients with probability p, and also matches to category B patients with probability p2. If the ratio of p2 to p is n, then p2 = np. Thus, if P(p,a,n,|A|,|B|) is the probability that M is a DOM with a matches given the variables p,n,|A|, and|B|, then P(a,n,p,|A|,|B|) = | | (1--np) |B| (1--p) |A|--a p a . We have already considered the instance where n = 1. Let us instead consider the case where n=g and g > 1.
Under such circumstances, gp > p, 1--gp < 1--p, (1--gp) |B| < (1--p) |B| and P(a,g,p,|A|,|B|) ≤ P(a,1,p,|A|,|B|) . As a result, the expected occurrence of a motif where n=g is less than the expected occurrence of a motif where n=1. Thus, if the Mev of a DOM is significantly less than one, then we can immediately conclude that n is not equal to g and that p2 is less than p.

S3.2 Unequal Frequency Analysis When Category A Matches Hold a Higher Likelihood
Suppose that that the heuristic searches through C total motifs and discovers w total DOMs. The Mev for each of the w DOMs is significantly below zero. From this, we conclude that the n probability ratio must be less than one within the DOM motifs. Our next goal is to determine the maximum allowable value for ratio n. To do so, we assume that C2 of the visited C motifs contain a probability ratio of n < 1. We may dive these C2 motifs into k different groups where every group j contains Xj motifs with probability ratio nj. Thus, is equal to the expected number of DOM motifs within group j that match at least a patients. Thus, if all our w DOM motifs match to m or more patients, then we can model the equation for expected DOM occurrence as ! j *V(p,m,nj ,|A|,|B|) = w.
Let us assume that our k groups are ordered such that nj+1 > nj. As a result, n1 represents the lowest--valued probability ratio among the k groups. It is trivial to show that as nj decreases, the value of V(p,m,nj ,|A|,|B|) must increase. Thus, V(p,m,n1 ,|A|,|B|) ≥ V(p,m,nj ,|A|,|B|). From this relationship, we may iteratively derive the following sequence of inequalities:
In this manner, we are able estimate the upper bound of the probability ratio for any discovered DOM motif. Given a DOM with parameters (m,w, C, |A|,|B|), we solve for the root of M(n1) using common numerical methods. The resulting solution for M(n1) = 0 will give us the maximum possible ratio of DOM match probabilities between category B and category A patients.

S3.3 Running an Unequal Frequency Analysis on the MS Sequence Dataset
When implementing the Haystack Heuristic, we assigned 51 MS patients to Category A, and 46 Healthy Controls to Category B. Afterwards, the Heuristic traversed 2,743,571 total motifs and outputted 171 DOMs. Thus, M(n1) = ln(Vm(n1)) + ln(2,743,571) -ln(171) for the results in our analysis. We plotted M(n1) for all values of n1 ranging from .01 to 1 (Supplementary Figure  1). As that plot demonstrates, M(n1) = 0 when n1 is approximately .31. Therefore, the DOM occurrence frequency within Healthy Controls is at most 31 percent of the DOM occurrence frequency within the MS patients.

Supplementary Materials 4
The Prime Motif was present in 104 sequences, which were distributed across 35 patients. As discussed in Section 3.2, the main differentiating factor in the Prime Motif involved the two selective components TNE and DTA. The selective components matched to 204 sequences, which were distributed across 46 individuals. All antibody matches had originated from the IGHV3 germline. The domination of the IGHV3 germline suggested that the Prime Motif might simply be an artifact of IGHV3 expansion among certain motif-matched individuals. In order to disregard this possibility, it was necessarily to analyze IGHV3 expansion across all repertoires. To do so, we first computed the number of non-redundant sequences across each of our 97 repertoires. Non--redundancy was realized by counting each unique CDR3 exactly once, for every processed repertoire. Afterwards, for every repertoire count, we computed the percentage of non--redundant sequences that were assigned to a specific IGHV3 gene. This percentage formed the IGHV3 usage, which we used to rank the repertoires by their levels of IGHV3 expansion. The ordered IGHV3 usage results are displayed in Supplementary Table 5.
The order of the repertoires in Supplementary  Table  5 were clearly not dependent on the presence of the Prime Motif. The first five repertoires within the table all originated from Healthy Controls. Two of these five Healthy Controls contained matches to TNE and DTA. Starting with the sixth position, we began see the Prime Motif matching repertoires, which were evenly distributed across the table. Eight of the final ten repertoires in the table matched to components TNE and DTE. Seven of these repertoires also contained matches to the Prime Motif. Thus, neither the Prime Motif nor the selective components were influenced by IGHV3 usage.

Sequence Counts Per Patient:
The first column contains the patient id, defined as either belonging to the MS or the HC category. The second column contains the total number of database--stored sequences for each patient; some of these sequences may be identical. The third column contains the number of unique, non--redundant CDR3 sequences associated with each patient. The fourth column contains the type of MS where applicable; patients with clinically isolated syndrome (CIS) were included if MRI and cerebrospinal fluid analysis were consistent with a MS diagnosis. The fifth and sixth columns shows each patient's age and sex. The seventh columns shows the patients' therapies (IFNb, interferon beta 1; GA, glatiramer acetate; NTZ, natalizumab; RTX, rituximab); "--" indicates that the patient was not taking an approved disease--modifying therapy for MS.