Genome-wide profiling of humoral immunity and pathogen genes under selection identifies immune evasion tactics of Chlamydia trachomatis during ocular infection

The frequency and duration of Chlamydia trachomatis (Ct) ocular infections decrease with age, suggesting development of partial immunity. However, there is a lack of clear correlates of immunity to Ct infection in humans. We screened sera from a cohort of Gambian children followed for six-months against a Ct-proteome microarray. At genome sequence level, we detected signatures of selection from a population of ocular Ct isolates from Guinea-Bissau. Together these approaches allowed us to highlight the focus of humoral responses and hypothesise new modes of pathogen immune evasion. Children who were susceptible to frequent and/or prolonged Ct infection had a less focussed antibody response, including preferential recognition of forty-two antigens. There was evidence of positive and purifying selection across the genome, but little balancing selection. In contrast, most antigens that were associated with susceptibility were under neutral selection. These data suggest an evasion strategy in which Ct presents a large panel of irrelevant antigens to the immune system to block or misdirect protective responses. Development of a focused immune response, possibly induced through vaccination, may be an effective strategy to promote protection to Ct infection.

Maximum likelihood reconstruction of the whole-genome phylogeny of 150 ocular Ct sequences from Asia (purple), Bijagós Islands (black), North Africa (green), Solomon Islands (yellow), Tanzania (blue), The Gambia (red) and USA (orange). Ct sequences were mapped to Ct A/HAR-13 using SAMtools80. Phylogenies were computed with RAxML from a variable sites alignment using a GTR+gamma model and are midpoint rooted. The scale bar indicates evolutionary distance.

Supplementary Tables
Supplementary Predicted localisations were defined using LocTree, Cello and psortB. Peak expression was defined from Belland et al 16 . Experimental localisation and RT-qPCR/microscopic expression data were determined using a PubMed literature search with three terms; "CT_xxx", "CTA_xxxx" and "Chlamydia trachomatis".
Supplementary Outer membrane 9 Predicted localisations were defined using LocTree, Cello and psortB. Peak expression was defined from Belland et al 16 . Experimental localisation and RT-qPCR/microscopic expression data were determined using a PubMed literature search with three terms; "CT_xxx", "CTA_xxxx" and "Chlamydia trachomatis". Immunogenicity and function were manually assigned from PubMed literature searches. Supplementary Table 3

Clinical cohort study and participants
In

Chlamydia trachomatis antigen microarrays
Ct protein microarray chips were prepared as described previously 32

Proteome microarray normalisation, filtering and clustering
The raw signal intensity data from the microarray was transformed by inverse hyperbolic sine transformation and normalised by mean-centring, these techniques were determined as the most suitable 'normalisation' step using relevant rank deviation (RRD) 33 . Post-normalisation the global median of the data was calculated, individual antigens whose median was lower than the global median were excluded.
Several different methods were tested to identify positive-negative breakpoints in the distribution of the data. We tested extrinsic and intrinsic methods. Extrinsic methods tested were; mean of no DNA controls, global mean and global mean plus 2 standard deviations.
Intrinsic methods tested were; k-means clustering, k-medoids clustering, fuzzy c-means clustering, hierarchical clustering and mixture modelling. The intrinsic methods were tested allowing for 2 to10 clusters. The average silhouette width of each antigen was used to determine appropriateness of the cluster configuration. Silhouette ranges from -1 to +1 and is defined by equation 1. (1) Where i is a data point, a is average dissimilarity with all other data points in its cluster, b is the lowest average dissimilarity to any other cluster of which i is not a member. If i is similar to other data in its cluster a will be low. If i is also dissimilar to data in the nearest cluster b will be high. In this case silhouette will tend towards +1. If i is not similar to its cluster a will be high. If i is similar to data in the nearest cluster b will be low. In this case silhouette will tend towards -1.
The mean of each silhouette per antigen resulted in the average silhouette width, which was a measure of the appropriateness of the cluster configuration. To determine positive responses, two clusters were identified and the method which had the highest average silhouette width for each antigen was identified. Data points clustered with the maximum OD/signal intensity point of each antigen were considered positive and the opposing cluster negative.

Diversity metrics
Ecological measures of diversity rely on species breadth/richness, the total number of species in a sample, and species diversity, which additionally incorporates the relative abundance of each species. In this analysis antigens were considered as species, abundance as the response to each antigen and the samples were either the complete data set or split into the (2) (3) Where S is number of antigens and pi is the proportion of antibodies specific to each antigen.
Pi is estimated as the amount of antibody specific to each antigen divided by the total amount of antibody present in each individual.

Chlamydia trachomatis population genetics metrics
Survey, clinical examination and sample collection methods have been described previously 37 Table 3).

Alignment, assembly and filtering by individual genes
Raw fastq files were aligned and assembled using BWA SAMtools 41 with A/Har-13 as the reference genome. Variants were called and filtered using BCFtools 42 and VCFtools 43 , with a minimum base quality score of 20 (99% accuracy) and a minimum read depth of 10.
Assembled sequences were combined and used as a database in the command-line version of Basic Local Alignment Search Tool (BLAST+) 44 . Individual gene sequences from A/Har-13 were used as queries to extract copies successfully sequenced in the isolates.
Sequences with more than half missing calls were excluded. MUSCLE algorithm was used for gene alignments 45 . Alignments were inspected manually using SeaView 46 and visualisation were output using Geneious 47 . (5)

Allele frequency-based signatures of selection
Positive selection increases the frequency of a few advantageous mutations, meaning most mutations are kept at a low frequency. In this situation the average difference between pairs of sequences is low but the number of segregating sites is relatively high, thus D is negative.
Purifying selection reduces the frequency of deleterious mutations, meaning mutations occur but not become common. In this situation the average difference between pairs of sequences is low and the number of segregating sites is high, thus D is negative. Balancing selection Purifying selection keeps alleles from becoming common, therefore are few high frequency alleles and H is positive. Positive selection causes alleles to rise to fixation and thus increase in frequency, in these situations H is negative.

Molecular and functional characterisation of target genes and proteins
Developmental cycle expression stage for each transcript was based on data and groupings from Belland et al 16 , this grouping was manually assigned to data from Nicholson et al 53 .
Localisation of expressed proteins was predicted using Cello 54