The mammalian Major Histocompatibility Complex (MHC) region contains several gene families characterized by highly polymorphic loci with extensive nucleotide diversity, copy number variation of paralogous genes, and long repetitive sequences. This structural complexity has made it difficult to construct a reliable reference sequence of the horse MHC region. In this study, we used long-read single molecule, real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) to sequence eight Bacterial Artificial Chromosome (BAC) clones spanning the horse MHC class II region. The final assembly resulted in a 1,165,328 bp continuous gap free sequence with 35 manually curated genomic loci of which 23 were considered to be functional and 12 to be pseudogenes. In comparison to the MHC class II region in other mammals, the corresponding region in horse shows extraordinary copy number variation and different relative location and directionality of the Eqca-DRB, -DQA, -DQB and –DOB loci. This is the first long-read sequence assembly of the horse MHC class II region with rigorous manual gene annotation, and it will serve as an important resource for association studies of immune-mediated equine diseases and for evolutionary analysis of genetic diversity in this region.
More than twenty years ago, John Trowsdale described the Major Histocompatibility Complex (MHC) as “the center of the immune universe” to emphasize its importance for initiating specific immune responses1. Indeed, the statement has been validated by the extremely large number of immune-mediated diseases associated to the human MHC (for review see ref. 2). The MHC is gene dense and characterized by considerable genomic complexity. Genes in the complex are traditionally divided into the MHC class I, -II and –III regions. The classical MHC class II molecules (DR, DQ and DP) are heterodimeric cell surface glycoproteins consisting of α- and β-chains that bind and present exogenous peptides to circulating CD4+ T-helper cells, thereby initiating immune responses against pathogens3.
The human MHC, denoted Human Leucocyte Antigen (HLA), spans approximately 3.6 Mbp on the short arm of chromosome six (6p21)4. Association of HLA genetic variation with increased susceptibility or resistance to inflammatory and autoimmune diseases, such as type 1 diabetes5,6, narcolepsy7, coeliac disease6 and rheumatoid arthritis8 is well documented. Furthermore, in the co-evolution of host-pathogen interactions, MHC variation has also been shown to influence resistance or susceptibility to infectious diseases. A classical example of this is the association of HLA-B53 with resistance to severe malaria in West Africa9,10.
In the horse, MHC association to immune-mediated diseases (e.g. Insect Bite Hypersensitivity11,12,13, Equine Sarcoids14 and Equine Recurrent Uveitis15), has also been demonstrated. The horse MHC region, designated ELA, is located on chromosome 20q14-q2216 and, as expected, contains highly polymorphic genes encoding classical MHC class II molecules17,18,19,20,21. The horse genome reference sequence (EquCab2) from the Thoroughbred mare Twilight was released in 200722. To complement the assembly of EquCab2, Children’s Hospital Oakland Research Institute created a BAC clone library CHORI-24123 from the Thoroughbred male Bravo. Using hybridization screening of CHORI-241 library, BAC clones constituting a minimum tiling-path of the horse MHC was identified and an ordered BAC contig map was constructed24,25.
Generating a reliable high-quality de novo sequence assembly using traditional Sanger-sequencing or short read next-generation sequencing technologies over a complex region, such as the horse MHC class II, is inherently difficult due to the many multigene families with copy number variation of paralogous loci, long repetitive elements and extensive allelic polymorphism.
Recent studies have shown that the long-read single molecule, real-time (SMRT) sequencing technology from Pacific Bioscience (PacBio) can be successfully applied to resolve complex genomic regions and improve genome assemblies26,27,28,29. The purpose of this study was to provide an enhanced and carefully annotated reference sequence of the MHC class II region.
Long-read sequencing and assembly of the horse MHC class II region
To resolve the repetitive complexity and genomic structure of the horse MHC class II region (~1.2 Mbp), we sequenced eight large-insert clones of the CHORI-241 Horse BAC library with PacBio long-read single molecule, real-time (SMRT) sequencing technology. These eight clones, derived from the male horse Bravo, had previously been shown to constitute a minimum tiling-path over the horse MHC class II region24,25. The strategy of sequencing single BAC clones resulted in mean read coverage exceeding 300-fold for each BAC clone, while the pooled sequencing strategy resulted in a mean coverage ranging from 108- to 287-fold (Table 1). The depth of each BAC assembly is shown in detail in Supplementary Fig. S1.
To produce a de novo assembly of the horse MHC class II region, we identified overlapping regions (junctions) of the eight sequenced BAC clones. The six detected junctions ranged from 13,475 to 141,592 bp (Table 2, junctions 1 to 6). Based on the previously published minimum tiling-path of the horse MHC class II region, we expected the final clone assembly to be a single contig. However, contrary to expectations, the BAC clones CH241-135M23 and CH241-455C07 were found to be non-overlapping. Thus, the assembly of the eight BAC clones resulted in two contigs, designated Contig 1 and Contig 2 (Fig. 1). We assembled Contig 1 from five (CH241-407K07, -288J19, -441N13, -440J07 and -135M23) and Contig 2 from three (CH241-455C07, -359L18 and -147K21) BAC clones.
We observed more than a 50-fold higher proportion of mismatched bases in junctions 5 (clone CH241-147K21 and CH241-455C07) and 6 (CH241-147K21 and CH241-359L18) as compared to junctions 1 to 4 (Table 2). To investigate if the higher proportion of mismatched bases could be explained by heterozygosity in the genomic sequence of the horse Bravo, five regions with dense SNP variation (in total 46 SNPs) and four small INDELs (1–2 bp) in junctions 5 and 6 were Sanger-sequenced. The results showed that 45 of the 46 SNPs and three of the four small INDELs were heterozygous in genomic DNA of the horse Bravo (detailed results in Supplementary Table S1). Agarose gel electrophoresis of a PCR product also confirmed that Bravo is heterozygous for the 247 bp INDEL. The sequence of this INDEL was identified as an Equine Repetitive Element 1 (ERE1). Insertion polymorphism of this transposable element has been previously reported30 and is considered to represent a recent integration in the horse genome.
The increased proportion of mismatched bases in junctions 5 and 6, and Sanger-sequencing results suggested that the BAC clones CH241-455C07 and CH241-359L18 originated from the same chromosome 20 of the horse Bravo (indicated as Chromosome A in Fig. 1) and that the BAC clone CH241-147K21 was derived from the other homologous chromosome 20 (indicated as Chromosome B in Fig. 1). Assuming that all observed SNPs in junctions 5 and 6 are reflecting “true” nucleotide difference, the genetic distance of the two chromosomes was estimated to be 0.0023 (p-distance). Applying a mutation rate of 7.242 × 10−9 per site per generation31, we estimated the divergence time between the two chromosomes to be approximately one million years, which is comparable to the divergence time of alleles from the same allelic lineage in human32. The entire list of heterozygous positions is provided in the Supplementary Dataset 1. Junctions 1 to 4 showed a low proportion of mismatched bases (≤0.004%), suggesting that all BAC clones in Contig 1 originated from the same chromosome (i.e. Chromosome A in Fig. 1).
The gap separating Contig 1 and 2 was closed using PacBio long-read SMRT sequencing of two overlapping long-range PCR amplicons (detailed information in Supplementary Table S2 and Fig. S2). The amplicon sequencing resulted in five error corrected full-length reads (19,701 bp) for Amplicon 1 and 47 error corrected full-length reads (14,649 bp) for Amplicon 2. The final assembly of Contig 1 and 2 together with Amplicon 1 and 2 yielded three extra junctions (Table 2, Junctions 7 to 9).
A single 1 bp INDEL difference and no base pair substitutions were observed in the junction 8 and no mismatched base pairs were observed in junctions 7 and 9. These junctions connected the Contig 1 with Contig 2 into a continuous gap free 1,165,328 bp genomic sequence, spanning the entire classical MHC class II region and part of a ‘gene desert’ located upstream of the class II region (Fig. 1). With the exception of a region between the BAC clones CH241-455C07 and -359L18 (denoted Chromosome B in Fig. 1) that partially spans the TAP2 locus, the resulting genomic sequence appears to be derived from a single chromosome.
Gene Annotation of the horse MHC class II assembly
In total, we identified 35 genes in the final assembly of the horse MHC class II region (Fig. 1). The manually curated gene models are described in a gff2 file format (Supplementary Dataset 2). Of the 35 identified loci, 20 had undisrupted open reading frames supported by mRNA data (Eqca-DRA1, -DRB1, -DQA1, -DQB1, -DQA2, -DQB2, -DQA3, -DQB3, -DRB2, -DRB3, -DOB1, TAP2, PSMB8, TAP1, PSMB9, Eqca-DMA, -DMB, BRD2, Eqca-DOA and COL11A2) while three loci had undisrupted open reading frames without supporting mRNA evidence (LOC504295, LOC525599, BTNL2). We classified ten loci as unprocessed pseudogenes as the predicted exon structure displayed disruptive mutations in the open reading frame (SIRPA, Eqca-DQB4, -DQA4, -DRB4, -DOB2, -DRB5, -DOB3, -DRB6, -DPB1, -DPB2) and one locus was identified as a processed ribosomal protein pseudogene RPS27. All pseudogenes except SIRPA and RPS27 were members of multigene families (Fig. 2). In addition, we identified a single pseudogene with 80% sequence similarity to the human non-classical HLA-G class I locus. Gene content of each BAC clone is described in Supplementary Table S3.
Classical MHC class II genes
Six Eqca-DRB loci of which three are functional, were identified in the equine class II region. A single Eqca-DRA1 locus was located on the forward strand in a tail-to-tail orientation with Eqca-DRB1, separated by approximately 17.5 kbp. The other Eqca-DRB genes were located on the forward strand. We observed that intron 1 of Eqca-DRB3 (derived from the BAC-clone CH241-455C07) had a potentially disrupting SNP in the splice acceptor site (AG > TG). All other sequenced DRB genes and the equivalent positions in EquCab2 had the functional splice acceptor site (AG).
We identified four pairs of Eqca-DQA and -DQB genes. Each pair was organised in tail-to-tail orientation (Fig. 1). Alignment of the predicted protein sequences of all functional Eqca-DQB genes to HLA-DQB1 and –DQB2 showed that exon 6 of the Eqca-DQB3 encoded a protein three amino acids longer than that encoded by the homologous HLA genes (Supplementary Fig. S3). Further, a nucleotide sequence comparison of the Eqca-DQB1 and -DQB3 locus revealed a ~400 bp deletion spanning the protein coding sequence of the conserved exon 6 and part of intron 5 of the Eqca-DQB3 gene. However, the spliced alignment of a single mRNA read33 at this locus suggested that a downstream 3′UTR sequence might be used as an alternative exon 6 (see Supplementary Fig. S4).
The coding sequences of the functional DR and DQ genes were compared to publicly available MHC alleles at MHC-IPD34 and the detailed results and alignments can be found in the Supplementary Dataset 3 and Table S4.
Non-classical MHC class II genes
The annotation resulted in the identification of four functional non-classical MHC class II genes (Eqca-DOA, -DOB, -DMA and -DMB) encoding the DO and DM molecules. When comparing the CDS of our gene model for Eqca-DOA to the RefSeq35 annotation of HLA-DOA, we observed a length difference of 36 bp (see Supplementary Dataset 3). The alignment of the CDS of Eqca-DOA to HLA-DOA and the corresponding region in EquCab2 revealed a SNP (GT > GC) at the splice site at the 5′-end of intron 5 (Supplementary Fig. S5). By aligning mRNA evidence at Eqca-DOA locus, we located an alternative canonical GT splice site, 36 bp downstream the human splice site, which predicts a 36 bp larger exon 4 in Eqca-DOA.
Density and distribution of horse SNP locations
To facilitate the experimental design and result interpretation from genome wide association studies (GWAS) we investigated the density of the probe sequences of Axiom Equine Genotyping Array and a horse SNP collection from the Broad Institute spanning the Bravo MHC class II region (Fig. 1). Overall, the results showed that the horse MHC class II region was densely covered with SNPs (Fig. 1). However, several classical class II genes, such as Eqca-DQA1, -DQB1, -DQA2 and -DQB2 were covered poorly.
Sequence comparison to EquCab2 assembly over the MHC class II region
We aligned each PacBio sequenced BAC clone insert to the MHC class II region of EquCab2. A summary of sequence differences observed from the alignments is described in Table 3. The estimated genetic distance (p-distance) and the two sequence identity scores, per-base and per-event identity, showed that clone CH241-455C07 was the least similar and clone CH241-288J19 was the most similar to EquCab2. It is important to note that we were not able to compare sequences that coincide with the gap regions of EquCab2. These regions were excluded from the analyses.
To further investigate structural differences between our final assembly (Bravo MHC Class II) and EquCab2, we analysed dot-plots and created a MAUVE alignment. The dot-plot analysis of the entire horse MHC class II region did not reveal any major structural discrepancies (Fig. 3A). However, the MAUVE alignment pinpointed a region with potential structural discrepancies between these two sequences. The overview of SNP and INDEL density of the entire region revealed a considerable increase in density starting with the region of the third intron of the Eqca-DRB2 gene (Fig. 3A). Detailed analysis of the region harbouring the potential structural discrepancies (Fig. 3B) indicated two ~900 bp regions, designated as I and II with different relative location. The genomic sequence of these regions corresponded to a non-LTR retrotransposon L1MA9 that belong to L1 LINE family.
MHC class II structure in mammals
There is a limited number of placental mammals with reliable MHC class II assembly and annotation that can be used to investigate structural differences. Considering this, we chose to compare the MHC class II region of seven mammals from the clade Boreoeutheria that contains the two taxa Euarchontoglires (represented by mouse and human) and Laurasiatheria (represented by horse, pig, cattle, domestic dog and cat). The divergence estimates of these species36, chromosomal location and directionality of the MHC class II region can be found in Supplementary Fig. S6.
We observed a high degree of conservation among the seven species in three distinct locations across the MHC class II region – the region from TAP2 to Eqca-DOA, as well as the two genes BTNL2 and COL11A2 (Fig. 4). In the five laurasiatherian mammals the conserved region of BTNL2 gene also included the genes LOC504295 and LOC525599.
The number of paralogous DR and DQ genes was found to be variable among the compared species. For example, horse has four Eqca-DQA and –DQB loci, while there is a complete absence of DQ genes in cat. In horse, the six identified Eqca-DRB loci were distributed across a region spanning ~0.7 Mbp with intervening genes separating the Eqca-DRB loci. All DRB genes in human, mouse, domestic cat, pig and cattle were clustered together and located on the reverse strand in tail-to-tail orientation to the DRA gene. The phylogenetic analysis of mammalian DQ and DR gene families indicated a species-specific clustering of most of the paralogous loci (Supplementary Fig. S7). In contrast to HLA, we found no evidence for the existence of functional Eqca-DPA or -DPB genes in horse or in the other compared species. All compared species, aside from pig, showed variable number of DP pseudogenes located between the highly conserved Eqca-DOA and COL11A2 loci.
Interestingly, the processed pseudogene RPS27 upstream the Eqca-DOB gene was present in domestic dog, but appears to be absent in all other investigated species. Domestic dog showed additional similarities to horse, such as the existence of only a single functional DRB gene located on the reverse strand in a cluster with the DRA gene and an inverted DRB pseudogene (Eqca-DRB4 and DLA-DRB2) located downstream of a DQ cluster. All of the compared laurasiatherian species had a single inverted DRB gene (DLA-DRB2, Feca-DRB4, SLA-DRB5 and Bota-DSB gene) and horse showed five inverted Eqca-DRB genes. However, no DRB inversions were observed in human and mouse.
To resolve the complex genomic structure of the horse MHC class II region, we sequenced eight BAC clones from the CHORI-241 library using long-read single molecule, real-time (SMRT) sequencing technology from Pacific Bioscience (PacBio)37. For each BAC we produced more than a 100-fold average sequencing coverage, ranging from 108- (CH241-440J07) to 572-fold (CH241-455C07). The PacBio generated errors are of random nature38, and by exceeding a threshold of 80-fold read coverage a perfect consensus sequence can be produced39. Thus, we expected the obtained coverage to be sufficient for the assembly of high quality consensus sequence with low error rate.
We have concluded that in contrast to all other sequenced clones, clone CH241-147K21 represents the genomic sequence from the alternative homologous chromosome 20 (Chromosome B in Fig. 1) with genetic distance of 0.23% (SNP-based distance). Therefore, we hypothesise that on average a SNP should be observed every 440 bp, if two BAC clone insert sequences are derived from different chromosomes. Due to the lack of SNP differences and an extremely low percentage mismatch of small INDELs in the junctions of these clones, we suggest that clones CH241-407K07, -288J19, -441N13, 440J07 and -135M23 are derived from the same chromosome. We also propose that the small INDELs observed in these junctions represent the PacBio INDEL error rate that have persisted the deep coverage and it is estimated to be very low (0.003%).
We were able to bridge the gap of the non-overlapping BAC clones (CH241-135M23 and CH241-455C07) with two long-range PCR amplicons that were full-length sequenced using SMRT sequencing technology. This confirmed the consecutive order of two non-overlapping clones. As we did not observe any base pair substitutions and only one single bp INDEL across the junctions 7 to 9 (4,557 bp in total), we suggest that Contig 1 is possibly derived from the same homologous chromosome 20 as clones CH241-455C07 and -359L18 (Chromosome A in Fig. 1).
The publicly available annotation of the horse MHC class II region in EquCab2 is based on automated gene prediction pipelines, cross-species alignment and the full-length mRNA sequences. As a consequence of the limited availability of equine mRNA sequence data and manual gene annotation, a fraction of previously predicted gene models are inaccurate and lack proper naming. Recently, mRNA sequencing data from several horse breeds and tissues were published33,40. These sequences were used to carefully validate predicted gene models, and, when possible, to extend the 5′ and 3′ UTRs. The annotation resulted in the identification of 35 manually curated genomic loci of which 23 were considered to be functional and 12 to be pseudogenes.
In placental mammals, three types of classical MHC class II antigen presenting molecules (MHC-DR, -DQ and -DP) are known. In most mammals, the DR α chain is encoded by a single Mhc-DRA locus with limited polymorphism. In contrast, the β-chain of the DR molecules is encoded by one of several polymorphic Mhc-DRB loci. Similarly, the DQ and DP molecules are encoded by α- and β-chains from their respective A and B loci. In horse, only two types of antigen presenting molecules, DR and DQ, are known.
Three functional Eqca-DRB genes have previously been identified41 and the IPD-MHC database34 contains 13 alleles derived from these loci. However, the first intron of the Bravo Eqca-DRB3 locus contained a disruptive acceptor splice site mutation in contrast to the corresponding acceptor splice site of the reference genome. The results suggest that the Eqca-DRB3 locus harbours both functional protein coding alleles and at least one null allele. This is similar to the HLA-DRB4 locus where null alleles have been identified42.
Evidence for at least two Eqca-DQB43 and two -DQA18 loci in horse was described more than a decade ago. Currently, the MHC-IPD database includes alleles from three Eqca-DQA and three -DQB loci. The results of our annotation are in concordance with the number of loci found in the MHC-IPD database, implying three functional Eqca-DQA and -DQB genes. However, all Eqca-DQB3 alleles currently present in the MHC-IPD are derived from partially sequenced cDNAs (the 2nd, 3rd and 4th exon). We observed that the conserved protein coding sequence of exon 6 of the Eqca-DQB3 gene has been deleted. In humans, a functional β-chain is encoded by six exons, where exon 1 encodes the signal peptide, exons 2 and 3 encode the 1st and 2nd domain, exon 4 encodes the membrane-spanning portion, and exon 5 and 6 encode the cytoplasmic tail of the β-chain44. Despite the deletion of exon 6, mRNA sequences of the Eqca-DQB3 gene are reported in MHC-IPD database, suggesting that this gene is transcribed in the horse. In addition, a single spliced mRNA read provided evidence for the transcription of the conserved exons 2 to 5 and an alternative exon 6. More analysis is needed to determine whether the predicted alternative exon 6 is being translated into a functional protein.
Based on sequence comparison of each of the eight BAC clones to the EquCab2 assembly, we observed considerable variation in estimated genetic distance (Table 3). The observed variation may be caused by sequencing and assembly errors in either the PacBio assembly or the existing EquCab2 reference genome. However, the true genetic variation that exists between the two individuals used to produce these sequences i.e. the horse Bravo (BAC clone library donor) and Twilight (EquCab2 assembly donor) is unknown. This, together with the fact that the EquCab2 assembly was generated by a combination of whole genome shotgun sequencing of Twilight and end sequencing of more than 300 000 BAC clones from Bravo22, makes it difficult to draw reliable conclusions from the sequence comparison. Nevertheless, the results clearly highlight the increased divergence of the clone CH241-455C07, that is the clone with the deepest PacBio read coverage (572-fold). This clone covers a challenging region of MHC class II that harbours three Eqca-DRB and two -DOB copies, thereby complicating assembly of this region. Considering that both sequenced horses are half siblings derived from an inbred MHC homozygous horse herd45 of a horse breed well known for having high inbreeding coefficient and low breed diversity46 we would not expect to observe high within-breed variation. The PacBio full-length sequenced reads make us confident in the sequence assembly of the two long-range PCR amplicons that show sequence discrepancies involving LINE element L1MA9. Hence we imply that even though rare INDEL errors are present in Bravo MHC class II assembly, the majority of the observed discrepancies between Bravo MHC class II sequence and the reference genome assembly represent combination of both, bona fide sequence variation between these two horses and the assembly as well as sequencing errors of EquCab2.
A comparative genomics analysis of the MHC class II regions from seven different mammalian species was performed. Despite the long divergence time of the compared species (85-100 million years ago)36, we could observe conserved gene order among seven distantly related mammals for the genes BTNL2, Eqca-DRA, TAP2, PSMB8, TAP1, PSMB9, HLA-DMB, HLA-DMA, BRD2, HLA-DOA and COL11A2, suggesting an orthologous relationship among these genes in mammals (Fig. 4). As expected, for the classical MHC class II genes poorly conserved gene order was found, where each of the species showed a different number, order and directionality of genes and pseudogenes. In human, there are five main HLA-DR haplotypes (DR1, DR51, DR52, DR8 and DR53) with variable number of DRB loci47 located in a single cluster in tail-to-tail orientation to a single HLA-DRA gene48. It remains to be seen if the plasticity of the human DR region with several DR haplotypes, is also present in the MHC of the horse. The other compared species in this study diverged from horse at least 85 million years ago. Considering the plasticity of the MHC class II genes49 and the accelerated birth-and-death evolution of multigene families50, it is likely that most of the gene duplications observed in horse occurred within the equid lineage. This was also supported by the topology of the phylogenetic trees of mammalian DR and DQ gene families, where all loci in horse were found in single clusters (Supplementary Fig. S7).
We would like to highlight an observation that all the compared laurasitherian mammals had a single inverted DRB gene, while this was not observed in human and mouse. This may potentially indicate that the inversion event occurred 85–100 million years ago, after the separation of Euarchontoglires but prior to the divergence of laurasitherian species. In horse, this ancestral inversion has evolved into a separate inverted DRB gene cluster, representing both functional genes and pseudogenes.
We have provided the first in-depth long-read sequenced and manually annotated MHC class II sequence of the order Perissodactyla. Long-read sequencing enabled us to resolve all difficult gap regions present in the current EquCab2 reference sequence and to identify the regions with possible errors that warrant further investigation. Manual gene annotation resulted in the discovery of several new loci and, together with comparative analysis, confirmed that locations of the MHC class II classical genes are distinctly different from the MHC class II region of other mammals. A continuous, well-annotated high quality “reference” sequence is essential for better experimental design of equine immune-mediated association studies, such as Insect Bite Hypersensitivity (Summer Eczema), Equine Sarcoids and Retinal Uveitis. The presented results will also be valuable for further evolutionary and genetic studies of the MHC class II region.
Materials and Methods
BAC clone culturing and DNA extraction
The eight Bacterial Artificial Chromosome (BAC) clones derived from the CHORI-241 Horse BAC library were propagated from LB agar stabs in LB-Agar high salt media with 25 μg/mL chloramphenicol. BAC-DNA was extracted with QIAGEN Large-Construct Kit (Qiagen, Hilden, Germany) according to the manufacturers’ protocol for clones CH241-147K21, -455C07, -135M23, -288J19 and -441N13. ATP-dependent nuclease treatment was omitted for clones CH241-440J07, -359L18 and -407K07 to obtain sufficient DNA yield for generating barcoded SMRTbell libraries.
BAC clone sequencing and assembly
SMRTbell libraries were prepared at the Uppsala Genome Center (Science for Life Laboratory, Uppsala University) according to the Pacific Biosciences 1.0 template preparation kit manufacturer’s instructions (Pacific Biosciences 10 kb template preparation protocol). The circular BAC clone DNA (2000 ng) was fragmented randomly using the Hydroshear at speed code 13 for 20 cycles. Sheared DNA was end-repaired and adaptor-ligated. The sequencing libraries were quantified on Qubit fluorometer (Thermo Fisher Scientific) and library size was confirmed using the Bioanalyzer 12000 kit (Agilent Technologies). For BAC clones CH241-147K21, -455C07, -135M23 and -288J19, each library was loaded on a separate SMRT cells and sequenced on a PacBio RSII system with P4/C2 chemistry (single clone sequencing strategy). For BAC clones CH241-440J07, -359L18, -407K07 and -441N13 barcoded adapters were added (pooled sequencing strategy) and equimolar amounts of three clones per SMRT cell were pooled before sequencing on the PacBio RSII system with P6/C4 chemistry. The movie time for sequencing was 240 minutes for all clones. De novo assembly of the generated sequencing reads was made by the hierarchical genome-assembly process (HGAP) algorithm51.
To analyse coverage of the assembly, we aligned all filtered subreads to the final assembly of each sequenced BAC clone construct with bwa v0.7.13 mem algorithm52 followed by SAMtools v1.3 ‘depth’ and ‘stats’53 analysis. Next, we trimmed the ends of all assemblies of clones CH241-147K21, -455C07, -135M23, -288J19 and -441N13, with less than 100-fold coverage, of clones CH241-407K07 and -359L18 less than 30- fold coverage and no trimming for clone CH241-440J07. GC content was analysed with BEDtools v2.25.054.
For each linear BAC construct assembly we identified the EcoRI cloning sites and removed the vector sequence. The two remaining segments of each linear BAC clone construct assembly were then joined using software GAP5 v1.2.14-r55 to generate a continuous insert sequence. The eight obtained insert sequences were further analysed with software GAP5 to detect overlapping regions (junctions) and the proportion of non-matching (mismatched) base pairs in these junctions.
Long-range PCR amplicon sequencing
Two long-range PCR primer sets were designed to amplify the missing genomic region between clone CH241-135M23 and CH241-455C07. PCR primer sequences were designed with Primer3 v.0.4.056 and can be found in Supplementary Table S5. As the template for amplification, 200 ng of genomic DNA from the horse Bravo was used for amplification with Roche Expand Long Template PCR System 3 (Roche Diagnostics, Basel, Switzerland). The final PCR reaction concentrations were: dNTP mix 450 μM, forward and reverse primer 300 nM (Amplicon 1) or 180 nM (Amplicon 2) each, PCR buffer with MgCl2 2.75 mM and 0.81 μl of Expand Long Template Enzyme mix. Thermal cycling was performed on a GeneAmp™ PCR System 9700 (Thermo Fisher Scientific) according to manufacturer’s protocol with 60 °C annealing temperature for Amplicon 1 and 61 °C for Amplicon 2. Both PCR products were pooled in a single sample (1660 ng) and sequenced on a PacBio RSII system (Science for Life Laboratory, Uppsala University) with P6/C4 chemistry. Filtered error corrected reads were assigned to Amplicon 1 or Amplicon 2 based on forward and reverse primer sequences and the full length amplicon sequences were further error corrected with Canu v.1.0.57 using the excess of shorter sequencing reads. Amplicons were used to join clone CH241-135M23 and CH241-455C07 in a single continuous sequence using GAP5 software.
To achieve a higher sensitivity in gene model discovery, we masked repetitive sequences present in the insert sequence of each BAC using the online version of RepeatMasker58. Masked sequences were subjected to automated and manual annotation. We analysed each assembled BAC clone insert sequence using the gene prediction software AUGUSTUS59 with the option of incorporating extrinsic data sources such as cDNAs, ESTs, RNA-seq data and homologous protein sequences. The detailed scheme, software and data set references used in the annotation pipeline are described in the Supplementary Fig. S8. Manual gene annotation was done following the Havana annotation guidelines60 and Drosophila melanogaster annotation review61. In the process of manual gene model annotation we evaluated the reliability and mapping quality of the mRNA evidence, splice site discrepancies, separated merged gene models of paralogous genes, start and stop codon positions, extended 5′ UTR regions and located correct 3′ end coordinates according to mRNA evidence. The final predicted open reading frames were translated with ExPASy (Swiss Institute of Bioinformatics, http://web.expasy.org/translate/) and used to search the human protein database using BLAST. Individual validated gene models were lifted over to the final non-repeat masked Bravo MHC class II assembly with a custom Perl script. Coding exons, 5′ and 3′ UTRs coinciding with the repetitive regions were extended based on to the mapping mRNA evidence.
Coding sequences of MHC class II genes and alleles at MHC-IPD database were aligned using EMBOSS Needle nucleotide alignment tool (EMBL-EBI, http://www.ebi.ac.uk/Tools/psa/emboss_needle/nucleotide.html).
Probe sequences from 670,796 markers of the Axiom Equine Genotyping Array (Affymetrix, Thermo Fisher Scientific) and the Broad Institute SNP collection (available at https://www.broadinstitute.org/ftp/distribution/horse_snp_release/v2/) were aligned to the final assembly with bwa -mem aligner v.0.7.13. Short 71 bp Axiom Equine Genotyping Array probe sequences were subjected to the stringent quality filtering, by removing overlapping probes and probes showing variation other than the tested SNP and sequences with mapping quality lower than 60.
Comparison to the EquCab2 assembly
We extracted the genomic region of EquCab2 chr20:32469145-33632786 for comparative analysis. The region was identified by aligning the two ends of the Bravo MHC class II assembly to the EquCab2 assembly. We aligned each BAC clone insert sequence to the extracted EquCab2 sequence and generated match and mismatch statistics (base pair substitutions, deletions and insertions) using BLASR62 software. We removed mismatching positions in the alignment caused by the unknown nucleotides (gaps) in the EquCab2 assembly. Observed base pair substitutions were used to estimate the genetic distance between the two sequences. Genetic distance was calculated as a proportion of the observed base substitutions to the total number of aligned bases excluding INDELs (p-distance). Per-base identity was estimated as a proportion of matched base pairs in the alignment to the total alignment length (including all bp in INDELs). Per-event identity was estimated as a proportion of matched base pairs in the alignment to the total number of events, where each base pair substitution and each INDEL was regarded as a single event, independent of size. We analysed variation with PipMaker dot-plot63 and Mauve version snapshot_2015-2-2564. Genomic sequence of the small structural discrepancies was compared to the genomic repetitive elements in Repbase Update database65.
Investigation of heterozygous positions
Additional Sanger-sequencing was performed for five regions with dense SNP and INDEL variation at the junctions of the BAC clone CH241-147K21 with clones -455C07 and -359L18. PCR primer sequences were constructed with Primer3 and are provided in Supplementary Table S5. Amplification and sequencing was done using the BigDye Direct Cycle Sequencing Kit (Thermo Fisher Scientific) on a ProFlex PCR System (Thermo Fisher Scientific) and GeneAmp PCR System 9700 (Thermo Fisher Scientific). Genomic DNA of the horse Bravo (4 ng) was used in a 10 μl amplification reaction with an annealing temperature of 62° in accordance with the manufacturers’ protocol. Cycle sequencing was performed with both the forward (5′-TGTAAAACGACGGCCAGT-3′) and reverse (5′-CAGGAAACAGCTATGACC-3′) sequencing primers on an ABI 3500XL DNA Analyzer (Applied Biosystems) and sequences were then analysed using CodonCode Aligner v5.0.2 (CodonCode Corporation).
Heterozygosity of the 247 bp INDEL was investigated by PCR amplification and agarose gel electrophoresis. Amplification was done as described above for Sanger-sequencing and the fragment size was determined using 4% agarose gel to detect the heterozygosity of the INDEL. Genomic sequence of the INDEL was compared to the genomic repetitive elements in Repbase Update database65.
The genetic distance was calculated as described above for the comparison to the EquCab2 assembly and the divergence time was estimated by dividing estimated genetic distance with twice the mutation rate per site per year.
Seven well annotated mammalian MHC class II regions (house mouse, human, horse, domestic dog and cat, pig and cattle) were chosen for comparative analysis. The annotation of the house mouse, human, pig, and dog was obtained from the Vega Genome Browser66 (release 64) and of the domestic cat and cattle from the previous publications67,68 and the RefSeq35 database. Alignments (Supplementary Dataset 4) were generated for mammalian coding sequences of MHC-DRA, -DRB, DQA, and -DQB loci using MUSCLE with the “Align Codon” option implemented in MEGA5.2.269. Neighbor-Joining trees for each locus were then constructed from estimated Jukes-Cantor distances (pairwise deletion) using MEGA5.2.2. The robustness of the trees was tested by 10,000 bootstrap replications.
The raw sequencing data of BAC clone, LR-PCR amplicon and Sanger-sequences (Detailed information in Supplementary Table S6) can be found in European Nucleotide Archive under study number PRJEB15527 at http://www.ebi.ac.uk/ena/data/view/PRJEB15527. The final assembly and annotation of Bravo MHC class II region (LT745777) is available from the ENA browser at http://www.ebi.ac.uk/ena/data/view/LT745777.
How to cite this article: Viļuma, A. et al. Genomic structure of the horse major histocompatibility complex class II region resolved using PacBio long-read sequencing technology. Sci. Rep. 7, 45518; doi: 10.1038/srep45518 (2017).
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors would like to acknowledge support of the National Genomics Infrastructure (NGI)/Uppsala Genome Center and UPPMAX for providing assistance in massive parallel sequencing and computational infrastructure. Work performed at NGI/Uppsala Genome Center has been funded by RFI/VR and Science for Life Laboratory, Sweden. The genomic DNA from the CHORI-241 BAC clone library donor horse Bravo was kindly provided by Donald Miller and Doug Antzak (Cornell University, NY, USA) and we would also like to thank Dan Andersson and his research group (Uppsala University, Sweden) for the initial propagation of LB stubs and large construct DNA extraction of the BAC clones. We kindly thank BILS bioinformatics advisory program and Marcel Martin for help with computational analyses. We also would like to acknowledge the masters students who participated in BAC-DNA extraction and design of primers for long-range PCR – Lisa Johansson, Marta Godia Perello, Helmuth Raser and Mathias Tabat. This study was funded by the Swedish-Norwegian Foundation for Equine Research (SSH).
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/