Ecologically coherent population structure of uncultivated bacterioplankton

Bacterioplankton are main drivers of biogeochemical cycles and important components of aquatic food webs. While sequencing-based studies have revealed how bacterioplankton communities are structured in time and space, relatively little is known about intraspecies diversity patterns and their ecological relevance. Here, we use the newly developed software POGENOM (POpulation GENomics from Metagenomes) to investigate genomic diversity and differentiation in metagenome-assembled genomes from the Baltic Sea, and investigate their genomic variation using metagenome data spanning a 1700 km transect and covering seasonal variation at one station. The majority of the investigated species, representing several major bacterioplankton clades, displayed population structures correlating significantly with environmental factors such as salinity and temperature. Population differentiation was more pronounced over spatial than temporal scales. We discovered genes that have undergone adaptation to different salinity regimes, potentially responsible for the populations’ existence along with the salinity range. This in turn implies the broad existence of ecotypes that may remain undetected by rRNA gene sequencing. Our findings emphasize the importance of physiological barriers, and highlight the role of adaptive divergence as a structuring mechanism of bacterioplankton species.


Calculation of population genetic parameters in POGENOM
1.1. Nucleotide diversity, Nucleotide diversity ( ), sometimes called heterozygosity, is defined as the average number of nucleotide differences per site between any two sequence reads chosen randomly from the sample population (0 ≤ < 1). POGENOM calculates of a single locus according to Schloissnig et al. 23  where x i,Bj is the count of nucleotide B j at position i in the genome for the sample, and c i is the total coverage (sequence depth) at position i for the sample. To calculate a genome-wide , is averaged over all loci by summing all i and dividing by the genome size. Loci not included in the VCF file are assumed to lack diversity (to have i = 0; but see the normalised below). To calculate a gene-wise , is instead averaged over all loci within the gene, including the start codon but excluding the stop codon. The calculation also works for alleles >1 bp (in case a variant caller was used that output haplotypes), then instead basing the calculations on counts of haplotypes (oligomers) rather than nucleotides for loci where alleles >1 bp are reported in the VCF file. POGENOM can also split counts of haplotypes into counts of individual nucleotides, if this is preferred (this is the default behaviour). When the splitting is applied, POGENOM will remove loci of individual positions resulting from the splitting that do not contain any variants (this may be the case for internal haplotype positions), from the pool of variant loci. For gene-wise , will for a gene and a sample per default be set to N/A if one or several loci included in the VCF file for the gene have missing data for the sample, to avoid biases between samples for the gene due to missing data.
A locus not reported in the VCF file can be missing because no genetic variation was observed, but also because the locus did not have sufficient sequence depth coverage in the pool of samples when running the variant calling. The latter can lead to values being biased downwards (since these loci are assumed to have i = 0) and can skew the comparison of between genomes (but should however not affect comparisons between samples for the same genome, if multi-sample variant calling was conducted). The for a sample may also be deflated if some loci included in the VCF have lower coverage for the sample than the threshold specified by the '--min_count' parameter, since these loci will not be included in the calculation. Finally, loci included in the VCF file but not fulfilling the '--min_count' criterion in at least the number of samples specified by the '--min_found' parameter, will also be excluded from the calculation, further deflating the . In order to adjust for these potential sources of errors, a normalised genome-wide is also calculated for each sample by dividing the genome-wide with an estimated completeness factor for the sample. The completeness factor is based on the assumption that loci with sufficient coverage (fulfilling the --min_count cutoff) in one sample are independent from those with sufficient coverage in another sample. Thus, the loci covered in sample 1 can be treated as a random subset of the genome and used to assess the completeness of sample 2, by calculating the fraction of loci covered by sample 1 that are also covered by sample 2. The completeness for a sample is assessed this way using all other samples, and the completeness factor is the average of these assessments. The normalised genome-wide is not used for F ST calculations since missing variant loci should affect intra-and intersample equally and have little influence on the F ST (see calculations below).

Fixation index, F ST
To calculate the fixation index (F ST ) for each pair of samples, the intersample has to be calculated. For permuted gene-wise F ST , the variant loci are randomly redistributed among the genes in a way such that each gene will obtain a new set of variant loci, with their associated allele frequencies, but will have the same number of variant loci as in the original case. The randomisation is done this way for every pair of samples (loci will be redistributed the same way for both samples in the pair). As for the F ST calculations above, only loci for which both samples in the pair have data will be included.

Amino-acid level and F ST
Gene-wise amino acid is calculated based on the variant loci within genes, including the start codon but excluding the stop codon. Amino acid for a single locus is calculated by modifying the gene sequence according to each detected allele (one at a time) for the locus in the sample and translating the modified gene into a peptide (based on the genetic code file). The counts of each unique peptide will then be used for the calculations of intra-and intersample (rather than the counts of individual nucleotides [or haplotypes] as above). This approach allows adequate amino acid diversity calculations also when having alleles >1 bp (haplotypes). The gene-wise amino acid level F ST is calculated analogously to the gene-wise nucleotide level F ST from the amino acid intra-and intersample values. As for gene-wise nucleotide diversity, a gene will for a sample get = NA if one or several loci in the gene that are included in the VCF file have missing data for the sample.

pN/pS
pN/pS measures the ratio of the nonsynonymous to the synonymous polymorphism rates, where pN equals the fraction of possible nonsynonymous mutations that are observed as polymorphisms and pS equals the fraction of synonymous mutations that are observed as polymorphisms. To calculate the pN/pS for a gene and sample, POGENOM first derives a consensus nucleotide sequence for the gene in the sample by modifying the reference nucleotide in variant loci based on the most frequent allele, while keeping the reference sequence in invariant positions. For each nucleotide position, every possible (single nucleotide) mutation relative to the consensus sequence is then recorded, and whether this mutation is nonsynonymous or synonymous and present as a polymorphism or not.
pN/pS is then calculated as: where n i is the number of observed nonsynonymous mutations (alleles), N i is the total number of possible nonsynonymous mutations, s i is the number of observed synonymous mutations and S i is the total number of possible nonsynonymous mutations for locus i. If no synonymous mutations are observed for the gene, pN/pS is set to NA. In addition to calculating pN/pS on a per-sample basis, POGENOM calculates it on all samples collectively by combing the allele frequencies of all samples.

Supplementary figures
Supplementary figure 1. Nucleotide diversity ( ) over time of five BACLs with at least eight samples in the LMO data set (upper figure). Lower panel compares difference in seasonal time (in days) between samples against difference in . We define difference in season time as the shortest distance between two dates of the year, i.e. January 2 and December 30 have a difference in seasonal time of 3 (independent of which years they were from).