Spontaneous mutation rate estimates for the principal malaria vectors Anopheles coluzzii and Anopheles stephensi

Using high-depth whole genome sequencing of F0 mating pairs and multiple individual F1 offspring, we estimated the nuclear mutation rate per generation in the malaria vectors Anopheles coluzzii and Anopheles stephensi by detecting de novo genetic mutations. A purpose-built computer program was employed to filter actual mutations from a deep background of superficially similar artifacts resulting from read misalignment. Performance of filtering parameters was determined using software-simulated mutations, and the resulting estimate of false negative rate was used to correct final mutation rate estimates. Spontaneous mutation rates by base substitution were estimated at 1.00 × 10−9 (95% confidence interval, 2.06 × 10−10—2.91 × 10−9) and 1.36 × 10−9 (95% confidence interval, 4.42 × 10−10—3.18 × 10−9) per site per generation in A. coluzzii and A. stephensi respectively. Although similar studies have been performed on other insect species including dipterans, this is the first study to empirically measure mutation rates in the important genus Anopheles, and thus provides an estimate of µ that will be of utility for comparative evolutionary genomics, as well as for population genetic analysis of malaria vector mosquito species.


Results
Successful mating and offspring generation was obtained with a female:male ratio of 1:4 for Anopheles coluzzii, and 1:2 for A. stephensi. Fatherhood was determined using whole genome sequencing information obtained for each potential male parent. Parental and focal offspring genomes were sequenced at a high coverage (≥ 20), whereas "bait offspring" were sequenced at lower coverage (≤ 20) (Supplementary Tables 1 and 2). One A. coluzzii offspring sample (Ac-F1-F13) was removed from analysis due to low yield and poor mappability (< 2%). We obtained 10 focal and 19 bait offspring for A. coluzzii and 13 focal and 17 bait offspring for A. stephensi (Table 1). Male parent identification. After joint variant calling for each species, true fathers were identified by relatedness 43 and Mendelian inheritance-based analysis, both using only autosomal, biallelic SNPs from all offspring. For A. coluzzii, male Ac-F0-M01 presented a significantly higher relatedness with offspring specimens than any other male (p-value < 0.0001; Fig. 1a). The same male sample showed the highest proportion of observed heterozygous genotypes among all offspring based on expected heterozygosity, i.e., when both parents are homozygous for different alleles (  (Table 3). These were the set of biallelic variant sites used for identification of candidate de novo mutations using a program in Perl that filters each criteria step. The first criterion applied was calling both parents as homozygous for the reference allele with a minimum read depth of 10, which yielded a total of 298,527 and 45,338 variant sites in A. coluzzii and A. stephensi, respectively (Table 3). Next, variant sites with alternative alleles that were either in the bait offspring or not called in every focal offspring were filtered out,  Synthetic mutations analysis. We generated synthetic mutations to estimate the false negative rate (FNR) of our mutation detection protocol for each species (see "Methods"). In total, 1000 synthetic mutations were inserted in each of the focal offspring (A. coluzzii-10 individuals; A. stephensi-13 individuals) across randomly selected autosomal sites where both parents achieved a read depth of 10 or higher. For A. coluzzii, of the 10,000 synthetic mutations introduced, 9914 had a read depth of 10 or higher for focal offspring, and therefore these were considered for FNR calculation. Ultimately, our de novo mutation detection program detected 7,658 synthetic mutations that passed all filters. This implies a FNR of 22.8% for the A. coluzzii analysis (2,256 synthetic mutations were not detected out of 9,914 inserted in callable sites; Supplementary Table 6). In A. stephensi, 12,724 out of 13,000 synthetic mutations were callable, of which 9563 were detected. Therefore, this species presented a FNR of 24.8% (Supplementary Table 6).   Table 6).

Discussion
All eight mutations detected occurred in non-coding regions with equal numbers of transitions and transversions. Five out of the eight de novo mutations detected occurred in male offspring. Male mutation bias in eukaryotes is well known 14,44,45 . Sanger sequencing validation identified only a single false positive mutation in A. coluzzii among all the de novo mutations detected by our bioinformatics pipeline. The minimal false-positive rates supported previous studies which showed that higher sequencing depth in multiple samples reduced the false positive rate. These previous studies reveal that 30× and higher read depth data in multiple samples gives 99% genotype accuracy, but lowering read depth increases false-positive genotypes due to mis-mapping and paralogous reads 18,46,47 . Like reports in bees and insects [18][19][20][21] , estimated mutation rates in A. coluzzii and A. stephensi were similar to each other. Overall, data for mutation rates in bees, flies, and mosquitos confirm that mutation rates are very close among species within the class Insecta, whereas values among mammal species differ by nearly an order of magnitude (Table 5). Interestingly, genome sizes among these eukaryotic species are not similar (Table 5), supporting the suggestion that genome size does not significantly influence mutation events in eukaryotes 7,19,48 . Conversely, there is a negative correlation (R 2 = 0.5953) between mutation rate and generation time (Fig. 2). Species with long generation times (i.e., few generations per year) tend to have high mutation rates and vice versa, a trend that is apparent across distant taxa from insects to mammals. This might be a mechanism to balance mutation load over evolutionary timescales 49 . The mutation rate of a species is a vital parameter applied in evolutionary and population genetics, but estimating mutation rates is a formidable challenge 48 .
Point mutations occur in both non-coding and coding regions and are typically characterized as neutral 50 , beneficial 51 , or deleterious 52 based on their effect on fitness. As the names imply, neutral mutations show negligible effects on fitness, beneficial mutations increase the fitness of carriers and their frequencies tend to increase because of positive selection, and deleterious mutations decrease the fitness of carriers and tend to be removed from a population by purifying selection. Mutations in non-coding sequence are not necessarily neutral because they may alter protein binding sites that have direct consequences on gene regulation 53,54 . All mutations in this study were identified in non-coding regions.
The mutation rate per generation in a eukaryotic species is extremely low, meaning that only a few mutations evolve within an individual's entire genome. Detecting newly risen mutations efficiently is a tedious quest, especially in large genomes. We have found the approach presented herein to be very effective and it is our hope that it will prove useful if applied to other mosquito and insect species.

Methods
The workflow for this study, depicted schematically in Fig. 3, consisted of sample collection, high-throughput genome sequencing, variant calling, confirmation of paternity, detection of putative de novo mutations in offspring, verification of mutations, and estimation of the false negative rate by use of simulated mutations. Pupae were separated by sex and distributed in replicates with varying ratios of multiple males to single females to achieve mating. Adults were allowed to mate for four days before blood-feeding on heparin-treated bovine blood using a Hemotek artificial blood feeder (Hemotek Ltd, UK). Egg cups were placed in each mating cage after blood-feeding and hatched larvae were counted and transferred to a larger container for subsequent rearing. Replicates chosen for further analysis (one for each species) were those with the fewest F0 males but produc-   Whole genome sequencing. For each species, the genomes of the female, potential male parents, and a subset of the offspring (designated 'focal') were sequenced at a target depth of 30×, while the remaining offspring (designated 'bait') were sequenced at a lower target depth of 15×. This reflects a strategy to minimize false positive mutations by dividing the offspring into two arbitrary classes: focal and bait. Only sites in focal offspring were checked for candidate mutations, and the sites checked included only those without variants in the bait offspring. This resulted in an efficient filtering method to detect sites in the genome that are prone to erroneous read alignment. Since mis-mapping errors are expected to be frequent at the sites where they occur, they should be detected even at the lower depth of coverage employed for bait samples, thus economizing on sequencing resources. Prior to DNA extraction, the lower abdominal segment of females (containing the spermathecae) were removed to prevent sperm DNA contamination in downstream sequencing analysis. DNA extraction was carried out using a Qiagen Biosprint 96 extraction robot (Qiagen, Hilden, Germany) following a protocol established in our lab for increased DNA yield 55  Pre-processing of sequence data, mapping, variant calling. Low-quality and adapter sequences were removed from reads using Trimmomatic v0.39 57 with settings illuminaclip:2:30:10, leading:3, trailing:3, slidingwindow:4:15, minlen:36. Reads were marked for PCR duplicates using Sambamba v0.6.7 58 . The trimmed paired reads of both species were mapped onto their respective reference genomes using BWA MEM v0.7.17 59 with default settings. Anopheles gambiae reference genome 'PEST AgamP4 60,61 was the mapping target for A. coluzzii samples. The PEST genome originates from an A. gambiae-A. coluzzii hybrid laboratory strain and is commonly regarded as suitable for genomic analysis of both species 62,63 . The sequence reads of A. stephensi were mapped onto the recently assembled genome of A. stephensi 'Indian' strain 64 . Mapping quality control statistics were generated with Qualimap v2.2.1 65 . Joint variant calling was performed using Freebayes v1.1.1 66 and biallelic SNPs were extracted from the VCF file using BCFtools v1.9 67 .

Male parent identification.
Various inheritance-based approaches are available for kinship analysis based on identification of common alleles shared by descent [68][69][70] . Here, we adopted similar approaches, using autosomal biallelic variant sites to identify the male parent from among potential fathers in each grouping by applying two strategies: (i) relatedness levels between parents and offspring were estimated by the KING inference method 43 as implemented in -relatedness2 from VCFtools 71 ; and (ii) concordance with expectations of Mendelian inheritance were assessed at variant sites where the genotypes of the parents are homozygous for different alleles, and therefore the offspring are expected to present only heterozygous genotypes. For the second method, separate analyses were performed for each potential male parent in combination with the female parent using a custom Perl script. The pair with the highest number of sites that were heterozygous in all offspring was identified as the true mating pair. After identification of the true fathers, the irrelevant F0 male samples were removed from the dataset.
Calling candidate mutations. Starting with the biallelic SNPs for each species in VCF format we applied a defined program in Perl to filter each variant site according to a set of criteria which define a candidate mutation as follows: (i) Both parents are homozygous for the reference allele with read depth (DP) ≥ 10, and no alternate allele reads present (AD ALT = 0). (ii) All bait offspring with called genotypes are homozygous for the reference allele with no alternate allele reads present (AD ALT = 0). (iii) All focal offspring have a called genotype and either 1 or 2 focal offspring are called as heterozygotes (or are hemizygous for the alternate allele in the case of male offspring at sites on the X chromosome) and DP ≥ 10, while the remaining focal offspring are homozygous for the reference allele with no alternate allele reads present (AD ALT = 0). (iv) For the heterozygous focal offspring in (iii): at sites on the autosomes, and for female offspring at sites on the X chromosome the number of alternate allele reads present (AD ALT ) must be related to the total read depth (DP): 0.27 × DP ≤ AD ALT ≤ 0.73 DP.
Manual curation of candidate mutations. All candidate mutations detected by our program were examined using Integrated Genome Viewer (IGV) 72 which facilitates visualization of aligned reads from the original BAM files. In order to rule out obvious mis-mapping artifacts candidate mutations were considered within the context of all overlapping reads and adjacent reads and their variant sites.
Sanger sequencing. After

Synthetic mutations.
To estimate the false negative rate (type 2 error), we added 1000 faux (= synthetic) mutations to each focal offspring sequence data and measured the ability of our analytical pipeline to detect them 19 . Synthetic mutations were randomly inserted in autosomal sites using BAMSurgeon software 75 . The introduction of a mutation was performed by changing a fraction of reads overlapping the site and replacing the existing base with an alternate base randomly selected for the site. The number of reads altered for each synthetic mutation was determined based on an empirical distribution of alternate allele read depths at all heterozygous autosomal sites in the full dataset for all offspring for each species. These empirical distributions were calculated for each overall read depth between 10 and 100. When a synthetic mutation was created at a site with a given overall read depth, the quantity of alleles altered was drawn from the appropriate, corresponding distribution for that depth. To calculate empirical distributions, determination of heterozygosity relied on the principles of Mendelian inheritance: when each parent was homozygous for a different allele, all offspring were assumed to be heterozygous at that site. For example, for A. coluzzii there were 82,616 such heterozygous sites where overall read depth was 12 (Supplementary Table 5). The empirical distributions for alternate allele depth at these sites, as well as the corresponding binomial distribution calculated from the data mean and variance, are shown in Supplementary Fig. 3. For a synthetic mutation site with depth 12, a random integer N would be chosen from the empirical frequency distribution and N reads would be changed to the randomly chosen alternate base in the focal individual's BAM file. After introduction of synthetic mutations, all BAM files were converted into FASTQ format and then processed with the mapping, variant calling, and filtering methods described above. For each species, dividing the number of synthetic mutations detected by the number of synthetic mutations inserted into callable sites provided an estimate of the false negative rate (FNR) which was used as a correction factor in the final mutation rate calculation.
Estimation of mutation rate. For the calculation of the mutation rate (µ), we desire the number of mutations per genome per generation. The enumerated total of mutations we detected in each species serves as the numerator. For the denominator we require an estimate of the number of callable sites, that is sites where we can confidently determine the presence or absence of a mutation, should one be present. For each species the corrected mutation rate was calculated as follow:

Data availability
Whole genome sequence data included in this study are deposited in NCBI GenBank under BioProject PRJNA732889. Anopheles stephensi from accession number SAMN19349627-SAMN19349659 and A. coluzzii SAMN19355128-SAMN19355162.

Code availability
Custom codes used for the analysis are available on GitHub page: https:// github. com/ vecto rgene tics-lab/ mutat ion-rate.