Whole genome sequencing of nearly isogenic WMI and WLI inbred rats identifies genes potentially involved in depression and stress reactivity

The WMI and WLI inbred rats were generated from the stress-prone, and not yet fully inbred, Wistar Kyoto (WKY) strain. These were selected using bi-directional selection for immobility in the forced swim test and were then sib-mated for over 38 generations. Despite the low level of genetic diversity among WKY progenitors, the WMI substrain is significantly more vulnerable to stress relative to the counter-selected WLI strain. Here we quantify numbers and classes of genomic sequence variants distinguishing these substrains with the long term goal of uncovering functional and behavioral polymorphism that modulate sensitivity to stress and depression-like phenotypes. DNA from WLI and WMI was sequenced using Illumina xTen, IonTorrent, and 10X Chromium linked-read platforms to obtain a combined coverage of ~ 100X for each strain. We identified 4,296 high quality homozygous SNPs and indels between the WMI and WLI. We detected high impact variants in genes previously implicated in depression (e.g. Gnat2), depression-like behavior (e.g. Prlr, Nlrp1a), other psychiatric disease (e.g. Pou6f2, Kdm5a, Reep3, Wdfy3), and responses to psychological stressors (e.g. Pigr). High coverage sequencing data confirm that the two substrains are nearly coisogenic. Nonetheless, the small number of sequence variants contributes to numerous well characterized differences including depression-like behavior, stress reactivity, and addiction related phenotypes. These selected substrains are an ideal resource for forward and reverse genetic studies using a reduced complexity cross.


Results
To discover variants associated with the depression phenotype in WLI and WMI rats, whole genome sequencing data was obtained using three different platforms: Ion Torrent Proton, 10X Chromium and Illumina xTen from male WLI and WMI rats. Each technique covered an average depth of 41, 27 and 43 for both strains, respectively ( Supplementary Fig. S1). X-chromosomal coverage was expected to be half of autosomal coverage but was found to be much higher on IonProton and Illumina X-ten sequencing results ( Supplementary Fig. S1).
Sequencing data were mapped to the rat reference genome rn6 using bwa 42 (Illumina and IonProton data) or LongRanger (10X Chromium data). The resulting bam files were used as the input to DeepVariant 43 to report genomic variants (i.e. SNPs and small indels) for each sample. GLNexus 44 was then used to conduct a joint analysis of variants across all six samples. Over 12 million unique variants were identified before filtering. The analysis workflow was designed to take full advantage of the data provided from three sequencing technologies. We were interested in variants that have a Phred quality score above 30, have a clear call for either reference, homozygous or heterozygous, have no matching calls between WLI and WMI and must not have both a high quality reference and alternative allele called on different sequencing platforms within the same strain (Fig. 1).
Phred quality scores and coverage varied greatly per genomic region ( Fig. 2 track 1 and 4). A large portion of the variants had a Phred quality score below 10 and were excluded from subsequent analysis. In total, 99465, 25937, and 6454 homozygous variants had a Phred quality score greater than 10, 20, and 30 in at least one sample,  Fig. S2). The majority of high confidence heterozygous calls came from a single technique, Ion proton ( Supplementary  Fig. S2). Closer inspection revealed that the majority (> 90%) of these calls was detected on homopolymeric nucleotide sequences. In addition, approximately 95% of these were deletions rather than SNPs, further confirming that these calls are due to errors in base calling homopolymeric sequences. To filter out this common sequencing error, all deletions on homopolymeric regions which were not supported by at least one other sequencing technique were removed ( Supplementary Fig. S3). 12 www.nature.com/scientificreports/ As a final result, 2232 and 2064 homozygous high confidence variants were discovered on WLI and WMI respectively ( Fig. 2 track 2). The majority were insertions (45.3%) followed by SNPs (36.9%), and finally deletions (17.8%) ( Table 1). Of these SNPs, approximately 57% were transitions, meaning a purine nucleotide was mutated to another purine or a pyrimidine nucleotide to another pyrimidine. The other 43% were transversion SNPs, in which a purine was replaced by a pyrimidine or vice versa (Table 1). A total of 655 and 894 heterozygous variants were identified for WLI and WMI ( Fig. 2 track 3). It should be noted that the heterozygous variants contained higher coverage than average as compared to homozygous variants ( Supplementary Fig. S4). This implies a large portion of these could be homozygous SNPs aligned to collapsed regions on the reference genome.
In addition, 79 and 119 homozygous variants were identified for WLI and WMI, respectively, with a Phred quality score of at least 10 in all three sequencing technologies. Though verified across technologies, quality scores cannot be simply summed. For certitude these were not included in the final selection.
We used SnpEff 45 to identify the impact, location (Table 2), and the nearest gene in proximity of these variants. About half of the variants (52%) are located within intergenic regions, whilst some (a total of 62) variants fall within exons, 2432 are within introns, and 450 are located within 5 KB upstream of a gene (Supplementary  Table S1).
In total, 1491 unique genes were in closest proximity to the final selection of homozygous variants across both strains. Of these, 744 genes were found in WMI and 866 in WLI (119 were found in both strains). These SNPs and indels are distributed across the entire genome and no genomic region shows enrichment of variants. However, three separate regions (1 kb) on WMI contained up to 5 SNPs on chromosomes 1,4 and 14. One of these clusters was positioned within the 3' UTR of Zfp418. The other SNPs were located within intergenic regions.
We identified 75 and 70 SNPs on the X-chromosomes of WLI and WMI, respectively. SNPeff identified 8 as downstream variants, 46 as intergenic, 14 within introns, 5 upstream of genes and 2 in the 3' UTR for WLI; 3 downstream variants, 51 within intergenic regions, 10 within introns, 4 upstream of genes, 1 in the 3'UTR, and 1 on a splice site/intron for WMI. Of these, two intron variants in WLI fell within HTR2C, a gene associated with depression 54,55 schizophrenia 56,57 , and stress response 58,59 . One intron variant on WLI fell within Il1rapl1, a gene associated with autism 60,61 and schizophrenia 62 . Additionally, an intron variant in WMI fell within CDKL5, a gene associated with neurodevelopmental disorders such as seizures and autistic-like symptoms 63,64 . No SNPs of high quality were identified on the Y-chromosome for either strain.
Previous research identified 101 genes that were significantly differentially expressed between WMI and WLI brain tissue 30,65 . We found 232 SNPs or indels located within or near these differentially expressed genes. Out of these variants, 128 fell in intergenic regions, 95 within intron variants, 4 upstream gene variants, 3 downstream gene variants and 2 within the 3'prime UTR of genes (Supplementary Table S2).
We also leveraged Gene Ontology-term (GO-term) and KEGG-term enrichment analysis using G-profiler 66 to explore the biological functions of genes in close proximity to sequence variants. We found an over-representation of several neurogenesis, behavioral and locomotion related pathways. Over-represented terms for WLI included locomotion, behavior, nervous system development, neuron projection and neurogenesis (Supplementary www.nature.com/scientificreports/ Table S3). Over-represented terms for WMI included Par-3-KIF3A-PKC-zeta complex, actin-mediated cell contraction, neuronal related and cellular stress related pathways (Supplementary Table S3).
Lastly, we validated our genome sequencing results by selecting 224 SNPs, half unique to each strain, and using multiplex PCR to amplify each target region (150 bp flanking each variant) in genomic DNA collected from 8 rats, including four WMI and four WLI with equal number of males and females. We then constructed sequencing libraries using these PCR products and sequenced them on an Illumina instrument. We were able to obtain PCR products from 89 and 87 primers sets targeting WLI and WMI specific variants, respectively. Among them, 75 WLI and 76 WMI targets met the following two criteria: 1. homozygous alternative in at least three rats of the target strain; 2. homozygous alternative in none of the rats of the opposite strain. Therefore, the positive rate of our stringent empirical validation using 8 rats was 85.8%.

Discussion
The goal of this research is to give us genetic markers for WLI and WMI in context of other strains in reduced complexity crosses and to give us candidate variants for immediate scrutiny of linkage to depression. We used three leading next-generation sequencing technologies to obtain a combined coverage of approximately 100X for each genome of two closely related inbred rat strains, the WMI and WLI. We identified 4296 homozygous variants with high fidelity that are located in close proximity to 1491 unique genes that differ between these two strains. The SNPs and indels identified in this dataset offer new opportunities for the identification of genes related to the phenotypic differences between the WLI and WMI strains.
The WMI strain was originally characterized as a genetic model of depression [29][30][31] . However, since their development based on behavior in the forced swim test, it has been suggested that this task measures coping style 68 , thus WLIs and WMIs may be a genetic model of stress coping style differences. Still, their blood RNAseq results contributed to the identification of a blood-based transcriptomic panel for human depression 38,39,69 . Furthermore, the strains differ in behavioral and hormonal responsiveness to acute and chronic stress 33,36 , and in drug-taking behaviors 36,70 . Thus, regardless whether the WMI is a genetic model of depression-like behavior or passive coping, the fact that many WMI phenotypes show behavioral parallels with human stress-related disorders, including major depression, inform its significance.
Each of the three sequencing methods we used has its own merits and flaws. For example, compared to the widely used Illumina platform, the Ion Torrent platform provides high quality data at a lower cost. However, it suffers at homopolymer regions. The 10 × Chromium linked reads technology attaches barcodes to high molecular www.nature.com/scientificreports/ weight DNA before library preparation and can detect large structural variants. But obtaining good quality HMW DNA is technically challenging and is associated with increased cost. Further, when utilizing sequencing data from a single technique, technical biases are likely to make their way into the final result. By removing variants called differently by sequencing platforms, the technical bias is mitigated across the final selection of variants. We used DeepVariant to identify SNP and small indels across all sequencing techniques 43 . Deepvariant has been shown to outperform GATK in different tests, especially in calling indels 71 . It also handles data from diverse sequencing platforms without additional calibration. We used LongRanger to map the 10X linked reads sequencing data to the reference genome because it incorporates the molecular barcodes into the mapping algorithm. Following DeepVariant analysis, we used GLNexus to conduct a joint analysis to obtain a list of raw genomic variants. Joint analysis empowers variant discovery by leveraging population-wide information from a cohort of multiple samples, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible 72 .
With the combination of different sequencing methods, a higher certainty of variant calling between WLI and WMI has been made possible, though throughout this experiment strict filtering was performed. We first removed any variants detected with any certainty in both WLI and WMI, because we are interested in the differences between these two strains, rather than their common differences to the reference genome. One caveat of this approach is that variants incorrectly called by DeepVariant (e.g. due to low coverage in a single method) can lead to the exclusion of potentially interesting targets. Similarly, a strict quality score cutoff of 30 was used whilst un-opposed by any other sample with a Phred quality score of 10 or higher. Based on the abundance of data we have collected, including thousands of identified variants, we decided that 1 in 1000 variants was a sensible cutoff to avoid hundreds of false positives in the final set of variants.
The current reference genome (rn6) consists of 75697 contigs and 1395 scaffolds with N50 lengths of 100.5 KB and 14.99 Mb respectively. These sequences combine into a golden path of approximately 2.8 billion bases. Due to the fragmented nature of the reference genome, the identification of structural variants has proven to be difficult. One example of this is that it is often not possible to establish whether sequence variation is strain specific or related to a problem with the reference genome. In addition to the 4,296 high quality homozygous variants discovered in this research, an additional 15268 variants were discovered in either WLI or WMI with no significant coverage or significant phred score on the opposing strain (called as ./. ). Without a high quality read on both strains we cannot verify newly discovered variants. These low quality calls could be caused by heterozygosity, low coverage or overlapping variants.
We were able to confirm 151 out of 176 (85.8%) of the variants in 8 additional samples using an independent method. We found 232 SNPs in proximity of 101 genes which were significantly differentially expressed in previous research 30,65 . A large portion of these fell within intergenic regions that are unlikely to affect gene expression (Supplementary Table S2). Others, like those upstream of the TSS, downstream of genes r within introns are more likely to regulate expression levels. Together, these results provided high confidence on the accuracy of our analysis results.
Previous research has shown several sex specific differences between the WLI and WMI strains. Although depression-like behavior is different between these strains in both adults and adolescents in males, strain difference is only found in adult females and not in adolescent females. In addition, in measures of anxiety behavior, strain difference is found in adult males but not in adult females 31 . In this investigation several SNPs were identified on chromosome X. A few of these variants are located in the intron of genes previously associated with psychiatric conditions. However, this does not guarantee these variants are directly responsible for the phenotypic differences between the WMI and WLI.
Cross comparison of the genomic positions of variants discovered on WLI and WMI with variants discovered in a panel of 44 inbred rat strains (unpublished data). Out of the 2232 variants on WLI and the 2064 variants on WMI, 1215 and 856 were unique to each strain, respectively. This indicates these variants are likely denovo to WMI and WLI. Future research can verify the origin of the variants identified in this research.
In this investigation 655 and 894 heterozygous variants were discovered on WLI and WMI respectively. Despite both strains being fully inbred, there is a chance that de-novo mutations could propagate as heterozygous variants within each substrain. A look at the coverage of these positions reveals an average two-fold coverage, implying these variants are homozygous on collapsed regions on the reference genome, or duplicated and mutated within either strain. With an updated reference genome these regions could be resolved and can contribute in a meaningful way to the identification of variants that contribute to phenotypic differences between WLI and WMI substrains.
The current reference genome (Rnor_6) is likely to contain many errors 73 , this means some caution is required when identifying variants for WLI or WMI strains based on comparison to the current rat reference genome. There is a chance that variants found in both strains could potentially be due to base level errors in the reference, i.e., there is no variant present at all. Similarly, when a variant is only reported in strain A, there exists a small chance that the variant actually is located on strain B (i.e. the base level error in the reference happens to be the same as the sequence in strain B). Thus, a small percentage of the reported mutations in WLI strain could potentially be present in WMI. This might contribute, to some degree, to the enrichment of neuronal GO-term annotation for genes located within the vicinity of WLI sequence variants.
GO-term annotation enrichment for genes in the nearest proximity of variants detected in WLI included locomotor behavior and neuron projection. This provides some evidence that these variants could be capable of producing an impact on behavior, however this will require further investigation. As locomotory behavior is a complex trait, a combination of variants can be causal. For WMI the terms: neuron to neuron synapse (GO:0098984), nervous system development (GO:0007399), generation of neurons (GO:0048699), and finally, the Par-3-KIF3A-PKC-zeta complex (CORUM:899) was significantly over-represented. The Par-3-KIF3A-PKC-zeta www.nature.com/scientificreports/ complex is interesting as both parts are in proximity of variants detected on WMI and it is involved in the establishment of neuronal polarity 74 . The ancestral WKY strain was noted for its highly variable behavior between individuals 21,22 . With the discovery of variants associated with psychiatric phenotypes in both strains it should be kept in mind that variants could have been both selected for and against. In addition, as discussed above, there is a small chance some variants are located on the opposite strain due to potential errors in the reference genome. For this reason, we have only included variants which are different between WLI and WMI and not those that are different relative to the reference genome.
Lastly, our current analysis is focused on SNPs and indels. Additional data are required to accurately identify other genomic differences, such as large insertions, deletions, tandem repeats, etc. Further, finding genetic differences between the WMI and WLI strains is only the first step in identifying causal variants for many of the phenotypes that are different between them. These causal variants can be identified by genetic mapping using F2 offspring from WMI and WLI parents. This strategy of generating a reduced complexity cross to map causal genetic variants has many successful precedents [10][11][12] and could lead to identification of novel genes and variants causing phenotypic variation between these two strains (e.g., depression-like behavior, drug abuse, memory, aging, stress responsiveness).

Methods
Animals. Liver tissue from 4 adult WLI (2 males and 2 females) and 4 adult WMI (2 males and 2 females) rats were collected. Rats used for genome sequencing were randomly selected from the colony. Blinding was not relevant in this study because rats were not subjected to any experimental conditions. Equal amounts of tissue from males and females were pooled for each strain (total weight = 20 mg). DNA were extracted using the Qiagen DNeasy blood and tissue kit (Cat# 69506).
Whole genome sequencing. For sequencing using the HiSeq X Ten instrument, DNA whole genome shotgun sequencing libraries were generated using 200 ng of genomic DNA as input for the TruSeq Nano DNA Library Prep Kit (Illumina). Indexed libraries were sequenced as pools of eight samples on a full slide (8 lanes) on an Illumina HiSeq X Ten sequencer using HiSeq X Ten v2.5 reagents. For sequencing using the Ion Torrent instrument, 1 μg of genomic DNA was sheared to an average size of 200 bp using a Covaris S2 Sonicator. Then 500 ng of the sheared DNA was used to prepare libraries for sequencing using the AB Library Builder™ Fragment library Kit on a Library Builder system. Libraries were used without amplification and size selected on a 2% Pippin Prep gel. After quantification using qPCR, the libraries (190 pg) were then used to prepare beads for sequencing using an Ion Torrent One Touch instrument. DNA on these beads then sequenced on an Ion Torrent Proton sequencer using Hi-Q chemistry and a P1 chip. For 10X Chromium sequencing, the Qiagen MagAttract HMW DNA kit was used for DNA isolation. Sequencing library was then constructed from 1 ng of high molecular weight (~ 50 kb) genomic DNA using the Chromium Genome Library kit and sequenced on Illumina Hi-Seq (150 bp PE).
Mapping. Illumina and Ion proton data were mapped to the rat reference genome (rn6) using bwa (reference). 10 × Chromium data were mapped to rn6 using LongRanger (ver 2.2.2). DeepVariant (ver 1.0.0) was used to call SNPs and small indels from the bam files and GLnexus was used for joint calling of variants.
Analysis. Variant identification was performed separately for each strain and sequencing method. A total of 6 samples spread over 2 strains and 3 sequencing technologies were analyzed. Variants with less than 10 reads across all samples or more than 300 on a single sample for a variant were removed. Variants with the same highest quality call for WLI and WMI were removed. Variants with an identical call for all three sequencing technologies within either WLI or WMI were stored for further analysis. Variants with 5 out of 6 uncertain calls (./.) were removed. Variants with the same highest quality call for WMI and WLI were removed. Variants with 5 out of 6 identical calls of which the last had a quality score less than 10 were removed. If the majority (> 90%) of reads were of the same variant call across all reads and both strains shared at least 25% of all reads, the variant was removed.
Variants were selected based on the highest quality call per method and removed if disputed by variants called on another sequencing method with call quality of at least 30 within the same strain. Only variants were included in which the call for WLI differed from WMI and one of two strains was called as 0/0 (reference allele). Finally, all deletions on a position consisting of two identical nucleotides (homopolymeric) which were not supported by multiple sequencing techniques were removed (Fig. 1). The final selection was exported to VCF per strain and type of call (homozygous or heterozygous). Figure 2 was generated with the aid of the Circos R package 76 .
SnpEff (v4_3t_core) was used for nearest gene identification, impact estimation and annotation of the VCF for selected variants 45 . Impact and nearest genes were estimated separately per strain, as well as heterozygous and homozygous variants. Variants marked as high or moderate impact were separated and placed in Table 3. The annotated VCF is available for reference. g:Profiler version e101_eg48_p14_baf17f0 was used for GOterm enrichment analysis, standard settings were used, no background dataset was utilized 66 . RatsPub 67 ) was used to explore a small set of genes nearest to variants enriched with the GO-term: neuron to neuron synapse (GO:0098984).
Validation of variants and small indels by targeted re-sequencing. Ear  www.nature.com/scientificreports/ equal distribution across the genome. Individual primer pairs were designed using Batch Primer 3 (http:// probes. pw. usda. gov/ batch prime r3/) at default settings for generic primers with total amplicon size set as an optimum of 100 bp with the amplified region containing the target SNP (or region of interest). The primer sequences and genomic DNA were submitted to Floodlight Genomics (FG, Knoxville, TN) for processing using a Hi-Plex targeted sequencing approach 75 . The Hi-Plex approach pools primers to PCR amplify targets and adds a barcode sequence during the amplification process. The resulting target library is then sequenced on an Illumina instrument. Data were then aligned to the fasta file containing the targeting target variants using bwa. Genotypes for each sample were called using Deepvariant.
Ethics approval and consent to participate. All procedures were approved by the Animal Care and Use Committee of The University of Tennessee Health Science Center and were conducted in accordance with the NIH Guidelines concerning the Care and Use of Laboratory Animals. All methods are reported in accordance with ARRIVE guidelines.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. Analysis scripts and annotated vcf files are available from github (https:// github. com/ trist andej ong/ WLI_ WMI_ analy sis). www.nature.com/scientificreports/