Identifying the essential genes of Mycobacterium avium subsp. hominissuis with Tn-Seq using a rank-based filter procedure

Mycobacterium avium subsp. hominissuis (MAH) is increasingly recognized as a significant cause of morbidity, particularly in elderly patients or those with immune deficiency or underlying lung impairment. Disease due to MAH is particularly difficult to treat, often requiring years of antibiotic therapy. Identification of genes essential for MAH growth may lead to novel strategies for improving curative therapy. Here we have generated saturating genome-wide transposon mutant pools in a strain of MAH (MAC109) and developed a novel computational technique for classifying annotated genomic features based on the in vitro effect of transposon mutagenesis. Our findings may help guide future genetic and biochemical studies of MAH pathogenesis and aid in the identification of new drugs to improve the treatment of these serious infections.


Descriptions of supplementary files:
• Supplementary Figure S1: Simulated data showing correctness of rank-based filter.
• Supplementary Figure S2 -Venn diagram of essential gene predictions for MAC109 by our method vs. TRANSIT.
• Supplementary Text S1: Protocol for preparing sequencing libraries   Figure S1: Simulated data showing correctness of rank-based filter. As described, simulated read counts were generated to test our rank-based filter procedure. A simulation of either 5 sequenced samples ((A) and (B)) or 50 samples ((C) and (D)) was generated. (A) and (C) show histograms of the read counts across all sites before applying the filter procedure in green and the read counts after applying the filter procedure in blue. In red we plot the pmf of our sampling distribution for the null distributed sites.
Performance was assessed by q-q plots in (B) and (D). In green are the empirical quantiles before applying the rank-based filter procedure and in blue are the quantiles after filtering. The red line represents perfect theoretical performance.   Tn-seq Library Preparation-Q5 Sequencing Oligos 10uM sol mar mix consists of an equal parts mix of the following oligos diluted to 10uM in Tris-Cl: 10uM sol adapt consits of one of the following oligos diluted to 10uM in Tris-Cl. Note that if multiplexing, a different oligo for each sample on the run must be used. Each oligo is constructed as one of four "backbones" with a different barcode (index) for distinguishing multiplexed samples. A mutliplexed run should consist of an equal molar mix of each backbone (to increase base diversity) Where "XXX XXX XX" are barcodes used to distinguish mutliplexed samples. Barcodes we've used include the following: "ACA CGA TC", "AGC ATA CA", "TGC TAC GC", "AGT CTA CA", "CTC ATG CA", "AGT TCG GA", "CAT GAT CG", "CGT CAT CA", "CGC GCG GT", "GAC CTG CA", "TGA GAC TT", "AAG TAG AG", "GAG ATC TT", "GCC GAT GT", "TAC GTA CC", "CAG TTC AT", "TCC CTA TA", "GTC CGA TC", "GGT TCA AC", "CAC GTA CT", "TGT CAA GT", "TGT TCC GA", "TTC CGG AG", "CGA TCA AG", "CGA GGA GA", "TGG GGG AC", "TGC CTC GG", "TTA CAA CG", "CGA AAC CC", "ATC ACT CT", "TTC AGC AT", "ACT TGG TG". Note: The Illumina software to demultiplex will organize reads by the reverse complement of the sequence included in the above oligos. 2. Amplify DNA in thermocycler using the following protocol: 1 cycle 30s @ 98 • C 10 cycles 10s @ 98 • C 30s @ 67 • C 30s @ 72 • C 1 cycle 2min @ 72 • C 15 minutes 3. Purify with 18uL SPRIselect beads. Elute into 20uL Tris-Cl. 4. Quantify DNA with Qubit or qPCR (Nanodrop is NOT reliable). DNA concentration should be between 5 -20 nM (and must be at least 2nM for sequencing). If concentration is not high enough then increase input to PCR#2. Note: On Agilent Bioanalyzer traces these libraries gave a small peak at about twice the size of the rest of the library (1000bp vs 500bp). This is believed to be due to some non-specific amplicon (likely linear amplification from the adapter sequence). This secondary peak should not affect bulk quantitation via Qubit, Bioanalyzer, or qPCR. Hypothetical ways to eliminate these peaks include increasing cycles of PCR#2. Note: Qubit is faster and easier for quantitation.

Variable Definitions and Setup
The read counts X i,j for each position (i) and each replicate (j) are assumed to be independent for all i, j. For each j, a subset of X i,j are distributed like NE mutants and therefore are identically distributedbut we don't know which subset. Our goal is to find an approximate subset that will have a distribution approximating the null distribution.

Rank-based filter procedure
First we compute the rank of the read counts at each site within each replicate, averaging ties. Call the ranks r i,j . Then, for each replicate, compute the mean rank of the other replicates (ie leave one out).
Then identify a subset of sites such that the mean ranks are within the expected 40% to 85% range: Finally, assemble the read counts of the null-distributed sites into the setX j , which is, by definition, a sample from an approximately null-distributed set of mutants.
Thus we have applied a rank-based filter to leave a set of samples that are mutually independent and (approximately) null-distributed. For simplicity, we index each element of the setX j such that each read count is represented with the variable Y k,j for k = 1, 2, 3, ..., K. K is the number of insertion sites after applying the rank-based filter. Therefore {Y k,j : k ≤ K} =X j . Y k,j can now be used for fitting the zero-inflated negative binomial distribution (for ES identification) or for computing the empirical cumulative distribution (for GD/GA identification).

Computation of ES p-value at each insertion site
For simplicity, we drop the j index as we perform identical calulations for each replicate using the corre-spondingX j . The zero-inflated negative-binomial distribution is: where 1[...] is the indicator function, Γ() is the gamma function. We will use maximum likelihood estimation to fit the parameters. The log-likelihood is: where K is number of samples and z is the number of samples that are zero (ie z = #{y k = 0}). The gradient is: We solve for estimates of the parameters (Θ,r,p) with the L-BFGS-B algorithm (Scipy v1.2.1). To define a "borderline ES" mutant we scale our parameters such that the mean is 5% of the null model but the dispersion ( 1 r ) is identical. Definep =p (1−0.05)p+0.05 . Thus, the cumulative distribution for a "borderline ES" mutant is: Define a second function: F L (y) = y−1 ỹ=0 N (ỹ;Θ,r,p) We calculate a continuous p-value by sampling from a uniform distribution between F L and F : where q ES i,j is the p-value for site i, replicate j. We have included the second index (j) to emphasize that we will have a p-value for each replicate.

Computation of GD/GA p-value at each insertion site
We define the empirical distribution function for replicate j as: where 1[...] is the indicator function Also define: where G L j (0) = 0. Note that because Y j,k takes discrete values, G L j (y) and G j (y) will generally differ. To calculate a p-value for site i (replicate j) we sample a uniform distribution bounded by these two values: Text S2