Comparison of bias and resolvability in single-cell and single-transcript methods

Single-cell and single-transcript measurement methods have elevated our ability to understand and engineer biological systems. However, defining and comparing performance between methods remains a challenge, in part due to the confounding effects of experimental variability. Here, we propose a generalizable framework for performing multiple methods in parallel using split samples, so that experimental variability is shared between methods. We demonstrate the utility of this framework by performing 12 different methods in parallel to measure the same underlying reference system for cellular response. We compare method performance using quantitative evaluations of bias and resolvability. We attribute differences in method performance to steps along the measurement process such as sample preparation, signal detection, and choice of measurand. Finally, we demonstrate how this framework can be used to benchmark different methods for single-transcript detection. The framework we present here provides a practical way to compare performance of any methods.

Escherichia coli (E. coli) was transformed with DNA plasmids from which RNA and fluorescent protein (yellow circles) were expressed. RNA was labeled with fluorophores using in situ hybridization (represented by red stars). A plasmid lacking the expression cassette was used as a negative control (top, pAN1201). A plasmid expressing eYFP from the J23101 promoter was used as a positive control (pAN1717). A plasmid containing the Ptac promoter was used for measuring induction of eYFP by IPTG (pAN1818).
Supplementary Figure 2. FISH or HCR were used to label RNA. (Left) Traditional fluorescence in situ hybridization (FISH) uses multiple singly-labeled oligonucleotide probes hybridized along the length of a target RNA. In this study, we used 25 probes, each 20-nt in length and labeled with a single TAMRA fluorophore, targeted to the eyfp transcript. (Right) Hybridization chain reaction (HCR) (as described in Reference 6) uses a 2-step approach for labeling RNA. In Step 1, a split-probe is used to target the RNA and serve as an initiator site. In Step 2, TAMRA-labeled hairpins bind the initiator probe and amplify as part of a chain reaction over time. In this study, we used 13 split-probes to target the eyfp transcript. Each pair of split-probes targeted a 52-nt region of RNA. Top row, left to right: Phase contrast was used for imaging cell bodies, DAPI was used to detect DNA, TAMRA-conjugated probes were used to detect FISH-or HCR-labeled RNA, and eYFP was used to detect protein expression. Bottom row: Phase contrast is shown overlaid with DAPI, TAMRA, or eYFP. Images shown are for pAN1818 grown in the presence of 100 µmol/L IPTG. Sample shown is pAN1818 grown in 100 µmol/L IPTG. Scale bar in the bottom right of each image is 2 µm. Supplementary Figure 6. A total of 12 single-cell measurement methods were used to measure distributions of cellular response across a range of stimulus, in triplicate. For all distributions, colors are used to distinguish biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple). Methods varied according to sample preparation (before fixation, treated with Kn or Cm; after fixation, RNA labeled with FISH or HCR), signal detection (microscopy or flow cytometry), and measurand (RNA or protein; for RNA, whole-cell TAMRA fluorescence was analyzed in addition to estimated RNA counts). Dashed lines indicate distributions for which > 5 % of the data lies below 0, after correcting for background signal in flow cytometry. Figure 7. Average AUC can be used to rank overall resolvability between methods. For each method, an average AUC was calculated from all seven values of the AUC profile. This was performed for each replicate. Methods are listed from left to right in order of lowest to highest average AUC.

Supplementary
Supplementary Figure 8. Cellular response was quantitatively parameterized using Hill functions fit to raw medians of distributions. For all 12 measurement methods, the raw medians (circles) were fit to a Hill function (lines) using weighted nonlinear least squares. The size of each circle is proportional to its weight in the fit, determined by its inverse variance. Error bars indicate 95 % confidence intervals of each median as determined by bootstrapping using 1,000 iterations. Color indicates biological replicate 1 (orange), replicate 2 (green), and replicate 3 (purple). Figure 9. Residual error from Hill fits to raw medians. Residual error from Hill fits to raw medians is plotted for each method. The size of each circle is proportional to its weight in the fit, determined by its inverse variance. Error bars indicate 95 % confidence intervals of each median as determined by bootstrapping using 1,000 iterations. Color indicates biological replicate 1 (orange), replicate 2 (green), and replicate 3 (purple).

Supplementary
Supplementary Figure 10. Cellular response was quantitatively parameterized using Hill functions fit to RPU-normalized medians. For all 12 measurement methods, the raw medians (circles) were fit to a Hill function (lines) using weighted nonlinear least squares. The size of each circle is proportional to its weight in the fit, determined by its inverse variance. Error bars indicate 95 % confidence intervals of each median as determined by bootstrapping using 1,000 iterations. Color indicates biological replicate 1 (orange), replicate 2 (green), and replicate 3 (purple).
Supplementary Figure 11. Residual error from Hill fits to RPU-normalized medians. Residual error from Hill fits to raw medians is plotted for each method. The size of each circle is proportional to its weight in the fit, determined by its inverse variance. Error bars indicate 95 % confidence intervals of each median as determined by bootstrapping using 1,000 iterations. Color indicates biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple).
Supplementary Figure 12. Hill parameters for amplitude of raw and RPU-normalized response functions. (a) Parameters estimated from Hill fits to raw medians. Each of the 5 subpanels is for methods that share the same scale of arbitrary units. (b) Parameters estimated from Hill fits to RPU-normalized medians. Error bars represent 95 % confidence intervals from fits, including bootstrapped uncertainty using 1,000 iterations. P-values indicates statistical significance of Friedman test for the null hypothesis that there is no replicate-to-replicate effect. Colors represent biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple).
Supplementary Figure 13. Hill parameters for half-maximal induction for raw and RPU-normalized response functions. (a) Parameters estimated from Hill fits to raw medians. (b) Parameters estimated from Hill fits to RPU-normalized medians. Error bars represent 95 % confidence intervals from fits, including bootstrapped uncertainty using 1,000 iterations. P-values indicates statistical significance of Friedman test for the null hypothesis that there is no replicate-to-replicate effect. Colors represent biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple).
Supplementary Figure 14. Hill parameters for effective cooperativity from raw and RPU-normalized response functions. (a) Parameters estimated from Hill fits to raw medians. (b) Parameters estimated from Hill fits to RPU-normalized medians. Error bars represent 95 % confidence intervals from fits, including bootstrapped uncertainty using 1,000 iterations. P-values indicates statistical significance of Friedman test for the null hypothesis that there is no replicate-to-replicate effect. Colors represent biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple).
Supplementary Figure 15. Hill parameters for offset from raw and RPU-normalized response functions. (a) Parameters estimated from Hill fits to raw medians. Each of the 5 subpanels is for methods that share the same scale of arbitrary units. P-value in bottom panel was calculated using all 5 panels, as described below. (b) Parameters estimated from Hill fits to RPUnormalized medians. Error bars represent 95 % confidence intervals from fits, including bootstrapped uncertainty using 1,000 iterations. P-values indicates statistical significance of Friedman test for the null hypothesis that there is no replicate-to-replicate effect, using rankings of replicates within each of the 12 methods. Colors represent biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple).
Supplementary Figure 16. Friedman test for relative bias between methods. Within each replicate, each of the four Hill parameters estimated using each of the twelve methods were ranked in order from lowest to highest. A random ranking of methods was used as the null hypothesis. For all replicates, p-values shown within each plot indicate the statistical significance of divergence from this null hypothesis. Lower p-values indicate a higher probability of relative bias between methods.
Supplementary Figure 17. Effect of antibiotic treatment on flow cytometry measurement of fluorescent protein prior to in situ hybridization. Diagonal line indicates perfect agreement between methods. Measurement performance is attributed to sample preparation by comparing measurements that differ in antibiotic treatment prior to in situ hybridization. Aside from this difference in sample preparation, these methods share consistent processes for signal detection, and measurand. Resolvability is quantitatively attributed to sample preparation by comparing AUC profiles. Gray numbers in each corner indicate how many AUC values were larger for one method compared to the equivalent AUC for the other method. Numbers within scatter plot indicate which pair of adjacent stimulus concentrations are used to calculate AUC. Color is used to represent biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple). The two-sided pvalue for a sign test is shown within the plot.
Supplementary Figure 18. Comparison of flow cytometry detection of fluorescent protein before versus after in situ hybridization. Measurement performance is attributed to sample preparation by comparing measurements that differ with regard to whether they were measured before or after in situ hybridization. Aside from this difference in sample preparation, these methods share consistent measurement steps for signal detection (flow cytometry), and measurand (protein). Resolvability is quantitatively attributed to sample preparation by comparing AUC profiles. Diagonal lines indicate perfect agreement between methods. Gray numbers in each corner indicate how many AUC values were larger for one method compared to the equivalent AUC for the other method. Colored umbers within each scatter plot indicate which pair of adjacent stimulus concentrations are used to calculate AUC. Color is used to represent biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple). The two-sided p-values for a sign test are shown within each plot.
Supplementary Figure 19. Effect of calibrating whole-cell RNA fluorescence intensity to estimate RNA counts. Measurement performance is attributed to measurand by comparing measurements of whole-cell RNA fluorescence versus estimated RNA counts per cell following calibration of microscopy data. Aside from this difference in measurand, these pairwise comparisons share consistent steps for sample preparation, and signal detection (microscopy). Resolvability is quantitatively attributed to measurand by comparing AUC profiles between whole-cell RNA fluorescence and estimated RNA counts per cell for FISH (left) and HCR (right). In each plot, diagonal lines indicate perfect agreement between methods. Gray numbers in each corner indicate how many AUC values were larger for one method compared to the equivalent AUC for the other method. Colored numbers within each scatter plot indicate which pair adjacent stimulus concentrations are used to calculate AUC. Color is used to represent biological replicate 1 (orange), biological replicate 2 (green), and biological replicate 3 (purple). The two-sided p-values for a sign test are shown within each plot.
Supplementary Figure 20. Bias between FISH and HCR is consistent with a relative difference in hybridization efficiency. Estimated hybridization efficiency ratios based on ratio of HCR to FISH for (Fano -1) (left), and based on ratio of HCR to FISH of mean RNA count per cell (right). Horizontal dashed black lines depict the respective median hybridization ratios for point estimates based on each type of sample statistic. Horizontal solid black line depicts the overall median hybridization ratio of point estimates from both sample statistic types, 0.359, which was used as our estimated hybridization efficiency ratio of HCR to FISH.

Supplementary Tables
Supplementary Table 1

. Growth protocol
Growth protocol is presented as recommended by the Minimum Information Standard for Engineering Organism Experiments (MIEO) 3 .

MIEO Category Factor Level Media components
Potassium phosphate 3 g/L Disodium phosphate 6.78 g/L Sodium chloride 0.5 g/L Ammonium chloride 1.0 g/L D-glucose 4.0 g/L Casamino acids 2.0 g/L Calcium chloride 0.  To implement BRASS for comparing any measurement process, consider the following step-bystep protocol.
1. Define performance test(s) to assess measurement performance. Ideally, performance tests are quantitative to enable statistical comparison between different measurements, but qualitative comparisons can be used as well. You can choose more than one.
a. Example performance tests (T) In this study, we were interested in comparing T1 -Resolvability was assessed using Area under the ROC Curve (AUC) T2 -Relative bias between single-cell measurements was assessed by fitting data to models of cellular response, and comparing the parameters estimated using different methods 1) Dose-response estimated from Hill equation 2) Transcription kinetics estimated from burst size and frequency 2. Identify measurement steps of interest which can be evaluated using the performance tests defined in step 1. Identify steps by listing out components of each measurement pipeline, for example using an Ishikawa "Fishbone" diagram.
a. Examples for Sample Preparation (P) In this study, we were interested in comparing two different sample preparation strategies for labeling RNA (FISH and HCR). We were also interested in comparing different antibiotic treatments prior to flow cytometry detection of protein (Kn versus Cm 3. Prioritize measurements according to how many can be practically executed in parallel from a single starting sample.
a. Example for selection of measurements Based on exploratory measurements, we found that we needed ~ 0.5 mL of bacterial culture for at least two different preparations of flow cytometry measurements prior to fixation. We also need six mL of bacterial culture for FISH, and another six mL of bacterial culture for HCR to provide enough material for microscopy and flow cytometry following RNA labeling. Additional measurands did not require any additional starting material, since they were detected from within cells, or generated during analysis. So, 12.5 mL of starting culture would be needed in total for all measurements. We chose to grow 20 mL of culture which can easily be performed in a 50 mL Falcon tube, which provided a sufficient quantity of starting sample for all subsequent measurements.
4. Prioritize samples according to what is needed to assess performance, and how many can be practically executed in a single experiment.
a. Considerations of performance tests for sample selection i. Resolvability -a minimum of 2 different levels of response are required to assess a measurement's ability to resolve change in stimulus. ii. Calibrating whole-cell fluorescence to estimate single-transcript count -a minimum of 4 concentrations at low induction are needed for estimating the initial slope of the calibration curve 4 . iii. Parametric evaluations using Hill functions -initial, exploratory experiments suggested that 8 levels of stimulus spanning the dynamic range of response would be sufficient to fit a Hill function. iv. Normalization of cellular response to Relative Promoter Units using a living reference material -in addition to 8 levels of stimulus, a negative control (using a plasmid lacking the expression cassette) and a positive control (constitutive expression from J23101) are required for background subtraction and normalization. v. The above constraints would be satisfied using a total of 10 samples: 1 negative control, 1 positive control, and 8 different levels of induction.
b. Considerations of experimental constraints for sample selection Based on previous experiments, we found that manually preparing and imaging ~20 samples at once was a practical limitation for in situ hybridization. So, we chose 10 difference samples including 1 negative control, 1 positive control, and 8 concentrations of induction. This way, after the sample were split for labeling by FISH or HCR, there would be a total of 20 samples to prepare and image (10 for FISH and 10 for HCR). And, the samples would provide the requisite conditions to assess bias in dose-response. 5. Design and execute the experiment including all samples and measurements identified in steps 1-4. Ensure that sufficient reagents are available for the requisite number of replicates, in order to eliminate batch-dependent variability between replicates. In this study, we chose to execute 3 biological replicates.
6. Analyze the data. In this step, various analyses can be performed to generate multiple "measurands". For example, calibrating RNA fluorescence per cell to estimate RNA abundance per cell was performed during analysis, and each of these measurands can lend themselves to a variety of performance tests.
7. To attribute measurement performance to measurement processes, compare performance tests between measurements in a pairwise fashion that keeps all processes the same except for the process of interest. Fractional-factorial design of experiments can be used to more efficiently explore how measurements performance can be attributed to different steps, and how to account for confounding effects between measurement pipelines that differ by multiple steps 3,5 .

Supplementary Note 3: Attribution of performance to sample preparation (antibiotic treatment)
Before measuring fluorescent protein by cytometry, antibiotics are added to freshly harvested cells to halt translation, so that fluorescence measurements reflect levels of gene expression at the time of harvest. Different antibiotics can be used for this purpose, however, differences in measurement that might arise from antibiotic choice are typically not studied.
To compare the effects of antibiotic choice on cytometry measurements of fluorescent protein, we compared two different antibiotic treatments on live cells for flow cytometry measurements of fluorescent protein. We found that kanamycin (Kn, P3) and chloramphenicol (Cm, P4) generally exhibit good agreement in performance, with slight differences in resolvability and relative bias ( Supplementary Figs. 12 -15 and 17). Systematic differences in all four Hill parameters indicated relative bias between the two preparations, although the difference was also typically small. These differences in resolvability and relative bias are presumably due to antibiotic type, although they could also be due to the timing of the measurement, because Cm-treated samples were measured after Kn-treated samples in all three replicates. Across the entire range of induction, Cm-treated samples consistently had slightly better resolvability than samples treated with Kn, although the difference was very small. Like the subtle differences in Hill parameters estimated from Cm-treated and Kn-treated cells, small differences in resolvability between the two antibiotic treatments could also be due to the timing of the measurement.