Sixteen diverse laboratory mouse reference genomes define strain specific haplotypes and novel functional loci

We report full-length draft De novo genome assemblies for 16 widely used inbred mouse strains and reveal extensive strain-specific haplotype variation. We identify and characterise 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome resulting in the completion of 10 new gene structures and 62 new coding loci were added to the reference genome annotation. Notably these genomes revealed a large previously unannotated gene (Efcab3-like) encoding 5,874 amino acids. Efcab3-like mutant mice display anomalies in multiple brain regions suggesting a role in the regulation of brain development.

sequencing data generated using whole genome Dovetail Genomics, PacBio RSII, short read Illumina and long-insert (3kb, 6kb and 10kb) paired Illumina (protocols and methods described in previous section). The targeted reassembly pipeline involved two phases. The initial phase involved extracting sequence read data aligned to these regions from the whole genome assembly for each strain. These data were then used to assemble and improve sequence. The improved sequence was then incorporated back into the strain assembly. The pipeline is described and summarised in Supplementary Figure 16.
Dovetail de novo contigs (DDC) For each strain, contigs, generated using the Dovetail Genomics de novo assembly pipeline, encoding these loci and approximately 100kb up and downstream are extracted. These flanking contigs can be used to anchor reassembled sequence to regions of the genome, before incorporating the improved sequence back into the updated assembly. Extracted Dovetail contigs were masked for repeats and low complexity sequence using Repeatmasker v4.0.5 31 .
Merged PacBio Collection (MPC) BLASTall v.2.2.25 32 was then used to identify strain-specific whole genome PacBio RSII reads mapped to the extracted (and repeat masked) Dovetail contigs. Pacbio reads matched to an extracted contig were retained. In addition, all strain-specific PacBio reads were mapped to the mouse reference genome (GRCm38) using BWA-MEM v0.7.12 29 , and reads uniquely mapped to these loci were also retained. All retained PacBio reads were then mapped back to the extracted Dovetail de novo contigs, and, where possible used to fill gaps between contigs. Additionally, these reads were retained for manual sequence improvement.
Illumina de novo contigs (IDC) Strain specific Illumina short reads, and their corresponding read-pair mates (mapped or unmapped), mapped to the reference genome (GRCm38) at these loci were extracted. All extracted reads were mapped to the Dovetail assembly contigs using Geneious R8 allowing a maximum of 2% base pair mismatch and up to 3% indels. Extracted reads that failed to map to the Dovetail contigs, and their corresponding mate-pairs (mapped or unmapped), were assembled into short contigs using Geneious R8 33 .

Long insertion Illumina collection (LIIC)
Strain-specific long insert (3kb, 6kb and 10kb) Illumina reads mapped to the reference at these loci were extracted, and mapped to the repeat-masked reference (GRCm38) genome. Read pairs where both mates mapped to a repetitive or low complexity region were excluded, and all other mapped read-pairs were retained.
Sequence improvement around these four loci Data extracted above was then used to manually reassemble and improve the region sequence quality surrounding the described loci. The reassembled sequence was then incorporated back into the whole genome assemblies using the steps described below.
Step 1. Beginning with the long-insertion illumina contig reassembly, the Dovetail, Illumina de novo and PacBio contigs are manually connected and adjusted into region scaffolds. Size of gaps were estimated using the long-insertion Illumina read-pairs.
Step 2. Illumina reads and Pacbio data were mapped to the adjusted region scaffolds with Geneious R8 33 . Mismatches and conflicts were manually checked, and errors introduced by PacBio long reads were corrected using the Illumina short read data.
Step 3. Reassembled region scaffolds were digested with BspQI restriction endonuclease in silico using Knickers 1.5.5 34 . Confirmed BioNano optical map digestion sites were then used to connect individual small scaffolds into a single larger scaffold. Gap-lengths were estimated using the optical map data.
Long range PCR (gap filling) To further fill gaps in the PWK/PhJ strain assembly across the Nlrp1 locus, long range PCR was performed. Unique primer pairs up to 20kb apart were used to generate PCR products, spanning assembly gaps, that were sequenced using PacBio RSII. These sequences were then incorporated into the Nlrp1 region-scaffold for PWK/PhJ generated in step 3 above. Primers and corresponding products are summarized in Supplementary Table 26.

Olfr members in CAST/EiJ
To assess olfactory gene family differences between the GRCm38 reference genome and the wildderived CAST/EiJ, the CDS of all olfactory genes annotated in GENCODE vM8 and NCBI's refSeq were extracted and used to search against the CAST/EiJ genome assembly predicted genes using BLASTall 32 . Significant matches (E < 10 -40 ) of CAST/EiJ predicted genes to the reference genome were marked as one of: 1. Match: same chromosome and approximate location as GRCm38 reference annotation, same order of adjacent genes and <1% gaps in the CDS.

2.
Match with gap: same as match, but >1% gaps in the CDS. Transcriptome sequencing data of CAST/EiJ olfactory sensory neurons 37 was used to manually fill gaps for these genes.
3. Cross: same as match, but the order of adjacent genes is altered.

4.
New: no significant match to any GRCm38 annotated genes, or where many CAST/EiJ predicted genes have significant hits to a single GRCm38 gene, all passing non-primary hits are considered novel expansions of the olfr family gene tree, whereas the top matching hit is considered orthologous to the GRCm38 gene and thus not novel. All predicted novel olfr family members were manually confirmed, and compared to previously published CAST/EiJ olfactory gene predictions 37 . In addition, to further identify novel olfr family members, transcriptome sequencing data of CAST/EiJ olfactory sensory neurons was mapped to the CAST/EiJ gene CDS collection. Unmapped reads were de novo assembled into contigs, which were then marked as 'new'.
5. Disrupted: partial CDS loss due to genome recombination or the insertion of transposable elements. 6. Fail: GRCm38 genes which have no significant matches identified in CAST/EiJ. 7. Gene loss: For genes marked as fail, CAST/EiJ illumina short reads mapped to the GRCm38 reference at the gene's location were examined for potential gene loss. Confirmed deletions of olfr family members in the CAST/EiJ genome relative to GRCm38 are marked as 'gene loss'

Reference genome sequence updates informed by strains
In order to generate sequence to fill gaps in the reference genome assembly (GRCm38) that rendered genes incomplete, we identified contigs in the C57BL/6NJ assembly that corresponded to the missing sequence by investigating the genome alignments in gEVAL 11 . C57BL/6J reads (PRJNA51977) were aligned to the C57BL/6NJ assembly using bwa mem with default settings. Reads mapping to the previously identified contigs were extracted using bespoke scripts (available upon request) and assembled using geneious 9.1.3 33 . The resulting sequence was compared to (a) the C57BL/6NJ assembly to test for sanity, (b) the reference assembly to confirm overlaps with the existing gap borders, (c) the respective cDNA to identify missing exons and ensure that the generated sequence completes the fragmented gene and (d) to publicly available C57BL/6J assemblies accepted by the Genome Reference Consortium as gap fillers using NCBI Blast Genomes. For regions where (d) didn't lead to suitable alignments, the newly assembled contigs will be submitted and integrated into the reference assembly as described before 39 . If suitable publicly available contigs could be identified, they were validated as described for (a), (b) and (c).

GENCODE reference annotation updates informed by strains
Manual annotation for the whole mouse genome has not yet been completed, so the Augustus predictions were prioritised to those on chromosomes 1-12, for which manual annotation has been completed. These predictions were further refined to those for which 75% of the introns were novel compared to GENCODE. This gave a subset of 785 predictions which were subject to manual annotation. These predictions were imported so they were viewable as a track within the Otter/Zmap annotation tools on the C57BL/6J reference genome assembly. Predictions were manually assessed and where necessary manual annotation was performed according to the HAVANA guidelines as detailed by the GENCODE project 40 . The annotation decisions that were taken are summarised in Supplementary Data 8.

Novel gene structures
The gene discovered on chromosome 11, was named as "novel protein similar to EF-hand calcium binding domain 13 EFCAB13 (Homo sapiens)". The majority of the exons were annotated based on RNAseq (Figure 3a) plus limited EST support. Introns 20 and 105 were only supported from the Augustus prediction (see previous section). These exons spliced correctly and were in frame and as such could be annotated according to the HAVANA guidelines: ftp://ftp.sanger.ac.uk/pub/project/havana/Guidelines/Guidelines_March_2016.pdf with the transcript annotated tagged as having an inferred exon combination. Six additional partial transcripts were also annotated based on EST evidence from C57BL/6J.

Standardized Phenotyping Pipeline for Efcab3-like KO mice
Mice underwent standardized phenotyping from 4 weeks of age, including weight curves, behavioral and morphological assessment at 9 weeks of age, intraperitoneal glucose tolerance test (ipGTT) following a 4h fast at 13 weeks, body composition assessment by dual emission x-ray absorptiometry (DEXA) at 14 weeks of age using a modified version of the MGP pipeline, using a standard chow instead of a high fat diet. At 16 weeks, random-fed mice were anesthetized using 100 mg/kg Ketamine and 10 mg/kg Xylazine and blood was collected retro-orbitally. Death was confirmed by cervical dislocation and heart removal. The standard operating procedures can be found at IMPReSS (www.mousephenotype.org/impress). Factors thought to affect the variables were standardized as far as possible. Where standardization was not possible, steps were taken to reduce potential bias. For example, the impact of different people completing the experiment was minimized ("minimized operator") as defined in the Mouse Experimental Design Ontology (MEDO) 44 as "The process by which steps are taken to minimize the potential differences in the effector by training and monitoring of operator." http://bioportal.bioontology.org/ontologies/MEDO/?p=summary  Neurological and Morphological Phenotyping Assessment At 9 weeks of age mice are assessed for gross behavioural abnormalities using a modified SHIRPA screen and their whole body morphology is examined using a standardised checklist.
 Glucose Tolerance Test At 13 week of age, mice were singly housed and fasted for 4h after which approximately 0.5mm of the tail tip was removed with a scalpel blade and a fasting blood sample was taken. Mice were then given an intraperitoneal injection with 2g/kg glucose and further blood samples were taken at 15, 30, 60 and 120 min post-glucose injection. Blood glucose concentration were measured using Accu-chek Aviva glucose meters (Roche, Basel, Switzerland).  DEXA At 14 weeks of age, mice were anaesthetised with isoflurane (IsoFlo, Zoetis, UK). Body composition [fat mass (g), fat percentage estimate (%), lean mass (g), bone mineral density (g/cm 2 ) & bone mineral content (g)] were measured using an Ultrafocus 100 (Faxitron, Tucson, AZ, USA). Nose to tail base length measurements were performed using a ruler with 1mm graduations after DEXA measurements. An internal calibration process was performed on the Ultrafocus 100 before each imaging session.
 Blood Sample & Tissue Collection At 16 weeks of age, blood was collected without pre-fasting by retro-orbital sinus puncture under terminal anaesthesia (100mg/kg Ketamine and 10mg/kg Xylazine, i.p.) between 08:00 and 10:30 hours. 100 µl of blood was collected into EDTA-coated paediatric tubes (Kabe Labortechnik GmbH, Numbrecht, Germany) for standard haematological analysis and flow cytometric analysis of peripheral blood leukocytes. The rest (~800 µl) was collected into heparinised paediatric tubes (Kabe Labortechnik) for plasma clinical chemistry analysis and insulin assay (MRC-CORD Mouse Biochemistry Laboratory). Selected tissues were then collected for either the Sanger Biobank or further analysis performed by external collaborators specializing in a particular organ.
 Laboratory Analyses Within 1 hour of collection, heparinised whole blood samples were centrifuged at 5000 rcf for 10 minutes at 8°C, the plasma removed and stored at 4°C until analysis. For clinical chemistry, plasma was assessed at 37oC for standard parameters on an Olympus AU400 analyser using kits and controls supplied by Beckman Coulter. Samples were also tested for haemolysis (Bernardi et al. 1996). Four parameters were measured on the Olympus AU400 analyser by kits not supplied by Beckman Coulter: non-esterified free fatty acids (NEFAC -Wako Chemical Inc., Richmond, USA), glycerol (Randox Laboratories Ltd., UK) (Champy et al. 2004), fructosamine (Roche Diagnostic, UK) and thyroxine (Thermo Fisher, MA, USA). An automated electrical impedance cell counter was used to perform white blood cell counts (WBC (x10³/µl)), red blood cell counts (RBC (x10⁶/µl)), red blood cell distribution width (RDW(%)), haematocrit (HCT (%)), mean cell volume (MCV (fl)), mean corpuscular haemoglobin (MCH (pg)), mean corpuscular haemoglobin concentration (MCHC (g/dl)), platelet counts (PLT (x10³/µl)) and mean platelet volume (MPV (fl)) measurements, whilst haemoglobin (HGB (g/dl)) was measured by spectrophotometry (Scil Vet abc+, Viernheim, Germany). Internal quality controls were run on a daily basis and external quality controls, assessed on a two weekly interval, were done by the RIQAS scheme (Randox, UK). Whole blood is stained with two titrated cocktails of antibodies. Panel 1 containing CD45, TCRab, TCRgd, CD161, CD4, CD8, CD25, CD44, CD62L and KLRG1. Panel 2 containing CD45, CD19, IgD, Ly6B, Ly6G, Ly6C, CD11b and I-A/I-E. Samples are fixed and red blood cells lysed prior to acquisition on a BD LSR II flow cytometer after running automated compensation using compbeads and BD FACSDiva software. Data is analysed using FlowJo after singlet doublet discrimination, a time gate is used to exclude HTS issues and leukocytes identified with a SSC and CD45 gate.
 Statistical Analysis A high throughput pipeline with multiple analyses and types of phenotyping data means that a single power calculation is inappropriate. Instead, the pipeline has been developed through empirically by selection of a workflow which balances rate of phenotype discovery with cost effectiveness. For phenotyping analysis, the individual mouse was considered the experimental unit within the studies. Statistical tests performed to compare genotype means, with sex as a covariate. For continuous data, an iterative top down mixed modelling strategy was performed as described in 45 ,using PhenStat version 2.8.0, an R package freely available from Bioconductor 46 . The package's mixed model framework was used, and all analysis was performed by setting the argument equation to "withoutWeight" (Eq. 1) and dataPointsThreshold was set to 2. Insulin data was log-transformed prior to analysis. The model optimisation implemented will adjust for unequal variances. The genotype contribution test p value was adjusted for multiple testing to control the false discovery rate to 5%.
Y ~ Genotype + Sex + Genotype*Sex + (1|Batch) [Eq. 1] For the categorical data, a Fisher Exact Test was fitted comparing the proportions seen between wildtype and knockout mice for each sex independently using PhenStat FE Framework with the default setting. The minimum p value returned from the two tests for a variable was adjusted for multiple testing to control the false discovery rate to 5%.    Figure 4: (a) The number of pseudogenes predicted by our annotation pipeline for each assembly is broken down by type. Processed pseudogenes are a result of a retrotransposition event, and consequently the introns have been spliced out resulting in a monoexon-like structure. Unprocessed pseudogenes have been formed through a duplication process and follow closely the intron-exon structure of their functional homolog. A smaller number of pseudogenes are related to immune-system related genes while another small fraction of pseudogenes could not be assigned a definite biotype. These cases are grouped under the "Other" category. The number of pseudogenes predicted per genome is roughly ~13,000, with slightly smaller numbers for more distant strains (e.g. Mus spretus, M. m. castaneus). (b) The corresponding peptides are subjected to a six-frame BLAST against the strain genome to identify homologous loci that have not been previously annotated. Each potential match is checked for a number of disablements such as insertions, deletions, stop codons and frame-shifts. The annotated pseudogene set is intersected with the previously identified pseudogenes by lifting over the manual annotation to each strain genome. The result is a 3-tier confidence hierarchy, where level 1 refers to pseudogenes annotated by both automatic means and lift over of manual annotation, level 2 focuses on pseudogenes identified only by lifting over the manually annotated GENCODE pseudogenes, and level 3 is composed of only automatically annotated pseudogenes.    Probe colour for Fiber-FISH photos: blue -5'end, white -3'end, green -H60 alleles and adjacent region, red -Raet1 alleles and adjacent region. Fiber-FISH hybridisation caused by a genomic duplication without H60 coding regions are marked with an asterisk ('*'). All fiber-FISH have been confirmed by at least five times identical patterns. (c) Phylogenetic tree of H60 and Raet1 alleles (amino acid, neighbour joining method) Note gene H60a is not encoded in C57BL/6 genome. 22

Nlrp1a Nlrp1b2
Nlrp1f-ps Nlrp1e   chimpanzee and gorilla the ancestral Efcab3-like is broken into two pieces due to a 15Mbp inversion on chr17. The ancestral promoter can be seen on the EFCAB13 side in the H3K4Me3 track, and expression across many tissue types in the transcription track (bottom). In contrast, the EFCAB3 side has no promoter signal and minimal measurable expression (top).
GRCh38 ( Total  IG_C_gene  IG_J_gene  IG_D_gene  IG_D_pseudogene  IG_V_gene  IG_V_pseudogene   TEC  TR_J_gene  TR_V_gene  TR_V_pseudogene   antisense  lincRNA  miRNA  misc_RNA  processed_pseudo Discordant  with the  GenBank  Entries  ---T11902C  -T5335C :5182.1A  ---G8889A  -® -C57BL/6NJ was used as the reference for comparisons as this strain carries a common mtDNA haplotype. † -C57BL/6J was not sequenced by this effort. This strain was included in comparisons due to use as a universal control strain. * -Marks a SNP that is discordant with the mtDNA sequence that was previously published° -A colon (:) is used to mark a missing base in the reference strain where an insertion was identified.   Width of the total brain cm 4_TB_height_CS1 Height of the total brain at CS1 (coronal critical section 1) cm 4_TB_height_CS2 Height of the total brain at CS2 (coronal critical section 2) cm 2 4_TCTX_area

Supplementary
Total cortical area cm 2 4_M2_length Length of the secondary motor cortex cm 4_M1_length Length of the primary motor cortex cm 3 4_Pons_height Height of the pons cm 4 4_TC_area Total cerebellar area cm 2 4_IGL_area Area of the internal granular layer of the cerebellum cm 2 4_Folia Number of folia digit 4_Med_area Area of the medial cerebellar nucleus cm 2