Main

The domestic goat (Capra hircus) is widely reared throughout the world, especially in China, India and other developing countries1. Goats serve as an important source of meat, milk, fiber and pelts, and have also fulfilled agricultural, economic, cultural and even religious roles since very early times in human civilization2. Evidence indicates that the goat might have been domesticated from two wild Capris (Capra aegagrus and Capra falconeri) 10,000 years ago within the Fertile Crescent, and then spread quickly following patterns of human migration and trade3. Today, there are >1,000 goat breeds, and >830 million goats are kept around the world according to a report by the UN Food and Agriculture Organization (http://www.fao.org/corp/statistics/en/). In addition to their value as domestic animals, goats are now used as animal models for biomedical research, to investigate the genetic basis of complex traits and in the transgene production of peptide medicines4,5. Despite the agricultural and biological importance of goats, breeding and genetics studies have been hindered by the lack of a reference genome sequence. In this work, we combined Illumina next-generation sequencing technology and whole-genome mapping of large DNA molecules to obtain a genome sequence for the domestic goat. We then annotated the genome, and identified rapidly evolving genes. Furthermore, based on an annotated set of goat genes, we generated and compared transcriptomic data from secondary hair follicles (which produce the cashmere fiber) with data from primary hair follicles of the Inner Mongolia cashmere goat, shedding light on the genetic basis of the formation of cashmere fibers.

Whole-genome mapping is an improved high-throughput optical mapping technology. Optical mapping has been used to compare the structures of bacterial genomes6,7,8, complete bacterial genome assembly9,10, assist in bacterial artificial chromosome (BAC) assembly11 and correct genome assembly errors12. Two plant genomes13,14 have recently been sequenced using BACs assembled through such optical mapping. However, the traditional process for generating optical mapping data involves mostly manual steps, and as a result the primary applications of optical mapping have been in the assembly of bacterial genomes, integration of BAC end sequencing and BAC clone assembly, and iterative correction of assemblies. Although traditional optical mapping methods have been applied successfully, they are complex and have low throughput, primarily because the required DNA extension, image capture and data analysis steps are inefficient. As a result, it has not been possible to generate and handle the massive quantities of optical mapping data that are required for the assembly of a large and complex genome.

To obtain a whole-genome restriction map of the goat, we used an automated, high-throughput whole-genome mapping instrument and recently developed data processing software. The instrument uses a chip-like channel formation device (CFD) to stretch and immobilize single DNA molecules onto a positively charged glass surface within a disposable cartridge (Fig. 1a and Supplementary Methods). This, combined with automated imaging and data analysis, addresses many of the inefficiencies that have limited the application of optical mapping to large genomes. The instrument automatically produced 100,000 single-molecule restriction maps in 3 h, providing 12× physical coverage of the goat genome. We then used a hybrid assembly approach to generate super-long scaffolds (super-scaffolds) by combining experimentally measured single-molecule maps with in silico restriction maps computed from scaffolds assembled from Illumina sequencing data (Fig. 1b and Supplementary Methods). The long super-scaffolds facilitated the anchoring of scaffolds onto chromosomes.

Figure 1: Whole-genome mapping.
figure 1

(a) Samples are loaded onto a chip-like, high-density, channel-forming device (CFD). Buffer fluid flowing through the channels stretches high molecular weight DNA onto a positively charged glass surface, which maintains the orientation and integrity of the DNA during subsequent steps. The immobilized single molecules of DNA are digested with a restriction enzyme for 10 min at 37 °C, stained with the dye JOJO-1 and imaged. The images are analyzed channel by channel to filter out nonlinear distorted fragments and small molecules, identify gaps between fragments and measure the size of retained high-quality fragments (colored green) to produce single-molecule restriction maps. (b) Scaffolds derived from de novo assembly of next-generation sequencing data are converted into restriction maps by in silico restriction enzyme digestion. Then, the distance between restriction enzyme sites in the sequencing-derived scaffolds are matched to the lengths of the optical fragments in the single-molecule restriction maps. Matches allow the scaffolds to be extended and linked into super-scaffolds.

Results

Short-read de novo sequencing and assembly

We sequenced genomic DNA from a 3-year-old female Yunnan black goat. High-quality DNA extracted from liver tissue was used to construct 14 paired-end sequencing libraries with insert sizes of 180 bp, 350 bp, 800 bp, 2 kb, 5 kb, 10 kb or 20 kb (Supplementary Table 1). Using the Illumina sequencing platform, we generated 191.5 Gb of high-quality reads (65.6-fold coverage of the estimated genome size), with read lengths ranging from 45 to 101 bp (Supplementary Fig. 1). These sequences were assembled de novo using SOAPdenovo (version 1.03) software15, resulting in 542,145 contigs and 285,383 scaffolds longer than 100 bp. The contig N50 size was 18.7 kb, which represents the size above which half of the total length of the sequence set can be found. The scaffold N50 size was 2.21 Mb (Table 1). To extend the length of scaffolds, we sequenced ends of a fosmid library having an average insert size of 40 kb constructed from DNA of the same goat (Supplementary Methods and Supplementary Fig. 2). A total of 2,041,189 paired unique sequences were generated from the fosmid ends, of which 140,296 pairs were matched to different scaffolds and were thus usable for joining scaffolds. This process increased the scaffold N50 size to 3.06 Mb (Table 1) and yielded an 2.66-Gb assembly containing 140 million Ns (5.26%) to fill in gaps. The assembled genome is 91% of the estimated 2.92-Gb size of the goat genome, based on predictions using the 17-mer method (Supplementary Fig. 3). To validate the quality of this assembly, we mapped onto it the raw reads generated from the small insertion libraries, which had been used for contig assembly and gap filling. Over 89% of the raw paired-end reads could be mapped to the assembled goat genome, of which 95% had the correct orientation and correct distance between the ends, indicating that the assembly is largely correct at the local level (Supplementary Table 2).

Table 1 Genome assembly statistics

Super-scaffold construction

Information about large linkage groups, such as chromosomes, is important for linkage analysis in animal breeding. Although assembling next-generation sequencing data into a draft genome comprising scaffolds is relatively straightforward, constructing a physical map of the structure of the chromosomes is still difficult and costly. Because a genetic or physical map is not yet available for the goat, we used whole-genome mapping technology to generate a restriction map of the goat genome and then assembled scaffolds into super-scaffolds that were on the order of the length of full chromosomes.

To obtain single-molecule restriction maps, we used large DNA molecules from a fibroblast cell line established from skin from the ear of the sequenced female Yunnan black goat (Supplementary Fig. 4). A total of 3,447,997 single-molecule restriction maps longer than 250 kb each, with an average size of 360 kb, were generated using the SpeI restriction enzyme. The total size of the restriction map data was 1,241 Gb. A hybrid assembly algorithm, which compares the experimentally determined restriction maps with the in silico restriction maps computed from scaffolds assembled from short-read data, was used to identify adjacent scaffolds and determine their relative location and orientation (Supplementary Methods). This process joined 2,090 scaffolds, which had an average length of >1.2 Mb, into 315 super-scaffolds. The final assembly had an N50 of 16.3 Mb and covered 92% of assembled scaffolds. The remaining 8% of scaffolds were too small (average length of 713 bp) to be used for whole-genome mapping. The largest super-scaffold was 56.4 Mb (Table 1).

To assess the quality of super-scaffolds, we used them to map goat expressed sequence tag (EST) sequences from the NCBI database (http://www.ncbi.nlm.nih.gov/nucest) and assembled de novo goat transcriptomes that we obtained from ten tissues (56 Mb in total). Among the 38,006 ESTs that were >300 bp (average length 1,006.5 bp), 99.2% had hits covering ≥96.3% of their length as revealed with BLAT16 (version 34, identity >95%) (Supplementary Methods and Supplementary Table 3).

We also used the core eukaryotic genes mapping approach (CEGMA) pipeline to evaluate the goat assembly17. With it, we mapped 97.58% of the core eukaryotic genes (http://korflab.ucdavis.edu/Datasets/cegma/) from six model organisms (Homo sapiens, Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, Saccharomyces cerevisiae and Saccharomyces pombe) to the goat super-scaffolds with coverage >70% (Supplementary Table 4). This mapping rate is higher than that obtained for the cattle genome18, which thus supports the completeness and high quality of the goat super-scaffold assembly.

Anchoring super-scaffolds to the chromosomes

The domestic goat has 29 pairs of autosomes and one pair of sex chromosomes (2n = 60)19. Cytogenetic comparisons indicate a high level of colinearity between goat and cattle chromosomes, and all 30 goat chromosomes have been ordered according to the International System for Chromosome Nomenclature for Bovids20. Based on chromosomal colinearity, we used the two cattle genome assemblies (UMD_3.1 and Btau_4.0) to anchor super-scaffolds to goat chromosomes. Specifically, 302 of the 315 super-scaffolds and 140 other scaffolds that were not included in super-scaffolds were assembled into 30 pseudo-chromosomes for the goat. In total, we anchored 2.52 Gb to the 30 pseudo-chromosomes, and assigned 138 Mb of unordered or unoriented small scaffolds or super-scaffolds to an artificial chromosome designated U. This assembly, which we refer to as CHIR_1.0, is publicly available through a genome browser interface and database (http://goat.kiz.ac.cn/GGD/).

To assess the reliability of the chromosome anchoring, we examined 28 goat genes that have been assigned to a specific chromosomal location21 (Supplementary Table 5). The chromosome assignments of all 28 genes were consistent with our results. As another test of the quality of our pseudo-chromosome assembly, we compared pseudo-chromosome 1 with two radiation hybrid maps of goat chromosome 1 that we generated for a male Boer goat from either 1,222 single-nucleotide polymorphism (SNP) markers on the Illumina BovineSNP50 BeadChip (Fig. 2a and Supplementary Methods) or 1,567 SNP makers on the Illumina OvineSNP50 BeadChip (Fig. 2b and Supplementary Methods). We found few rearrangements within super-scaffolds, which were assembled without using cattle colinearity information. In addition, we found few rearrangements between the pseudo-chromosome 1 assembly and the two radiation hybrid maps. Taken together, these results suggest that the assemblies of super-scaffolds and chromosomes are accurate.

Figure 2: Colinearity between super-scaffolds and the radiation hybrid maps (RHMap) of goat chromosome 1.
figure 2

(a,b) Maps were generated using BovineSNP50 BeadChip (a) and OvineSNP50 BeadChip (b). Goat GenomeMap is the assembled pseudo-chromosome 1 generated by anchoring super-scaffolds and scaffolds (regions between super-scaffolds) using the published bovine genome (UMD_3.1 and Batu_4.0). Super-scaffolds anchored on the chromosome are indicated beside the GenomeMap. Only a few rearrangements (blue lines between RHMap and GenomeMap) exist within super-scaffolds.

We extended our comparison between goat and cattle to all chromosomes. All autosomes were in strong colinearity (Supplementary Fig. 5a). Because most goat super-scaffolds are long (N50 = 16.3 Mb), if the super-scaffolds were of low quality, we would expect to see many rearrangements between goat and cattle, but this is not the case (for an example of the high colinearity between a goat and a cattle chromosome, see Supplementary Fig. 5b).

Notably, we observed large rearrangements between the X chromosomes of goat and cattle (Supplementary Fig. 6), even though the X-chromosome linkage group is usually conserved in placental mammals21. The same rearrangements were observed on the X chromosomes when comparing them to both cattle genome assemblies (UMD_3.1 and Btau_4.0). Even within a single goat super-scaffold there are large rearrangements (Supplementary Fig. 6c). Because the super-scaffolds were assembled without referring to cattle synteny information, these rearrangements are probably not a result of incorrect assembly based on the optical mapping data, but rather are due to the divergence of the two species. In addition, the sheep reference genome, which was generated by integrating dense physical maps and a large BAC sequence data set (Y. Jiang and International Sheep Genome Consortium, unpublished data), is highly congruent with our goat genome and contains the same rearrangements on the X chromosome. These observations suggest that the large rearrangements between bovine and caprine X chromosomes are real, and support the high quality of our goat genome assembly.

Repetitive sequences and transposable elements

Transposable elements make up a substantial fraction of mammalian genomes and contribute to gene and/or genome evolution22. The goat genome has transposable elements similar to those of cattle22 in that the genome contains large numbers of ruminant-specific repeats, which comprise 42.2% of the goat genome (Fig. 3 and Supplementary Table 6). However, the goat genome has 80% fewer SINE-BovA repeats (971,273 in goat and 1,839,497 in cattle) and >40% more SINE-tRNA repeats (665,366 in goat and 388,920 in cattle) than does the cattle genome, suggesting that the SINE-BovA repeat expanded primarily in the cattle genome22, whereas the SINE-tRNA repeat expanded specifically in the goat.

Figure 3: Summary of goat chromosome assemblies.
figure 3

(1) Ideograms of the 30 chromosomes of the goat (in Mb scales). The estimated length of each chromosome is indicated in the outermost ruled circle. Boundaries of anchored super-scaffolds and scaffolds are shown as black lines. (2) Gene density represented as the percentage of the sequence encoding genes for nonoverlapping, 1-Mb windows. (3) Percentage of coverage of repetitive sequences for nonoverlapping, 1-Mb windows. (4) Percentage GC content for nonoverlapping, 1-Mb windows. (5) Transcription state. The transcription level for each gene was estimated by averaging the fragments per kb exon model per million mapped reads (FPKM) from different tissues in nonoverlapping 3-Mb windows.

We also analyzed the degree of divergence for each type of transposable element in the goat genome, and found few recently diverged transposable element classes (Supplementary Fig. 7). This may be a result of the difficulty of calling repeats with high similarity from short-read sequencing data23. However, the general distribution patterns of transposable element classes across chromosomes are similar to that of other assembled mammalian genomes24,25 (Supplementary Fig. 8).

Gene and gene family annotation

We used three gene-prediction methods (homology-based annotation, ab initio prediction and RNA-seq-/EST-/cDNA-based annotation) to annotate protein-coding genes. We then merged the results from each method to obtain a consensus gene set of 22,175 protein-coding genes (Fig. 3, Supplementary Tables 7 and 8), with a mean coding sequence length of 1,385 bp and an average of eight exons per gene. The average lengths of exons and introns were 168 bp and 3,955 bp, respectively (Supplementary Table 7). In total, 17,927 annotated protein-coding genes were expressed in at least one of the ten tissues examined by transcriptome sequencing (RNA-seq) (Fig. 3, Supplementary Table 9 and http://goat.kiz.ac.cn/GGD/). Because untranslated regions are difficult to annotate, we used the RNA-seq reads to extend the untranslated regions of 4,740 genes. The gene models derived for goat were highly similar to those of closely related species, supporting the quality of the annotation (Supplementary Fig. 9).

We identified 17,129 orthologous gene pairs between goats and cattle, and 16,771 orthologous gene pairs between goats and humans. A phylogenetic tree constructed from 8,325 single-copy orthologs in goats, cattle, horses, dogs, opossums and humans suggests that goats shared a common ancestor with cattle about 23 million years ago (Fig. 4a). We further compared orthologous gene pairs between goat and cattle based on ratios of nonsynonymous (Ka) and synonymous (Ks) substitution rates to identify 44 rapidly evolving genes under positive selection (Supplementary Table 10), of which seven are immune-system genes and three are pituitary hormone or related genes. The rapid evolution of immune-system genes has also been observed in cattle24. Rapid evolution of pituitary hormones may be related to differences between goats and cattle in milk production, development rates of the fetus and/or hair variation, which are traits associated with pituitary hormomes26,27 (Supplementary Discussion).

Figure 4: Goat gene family analysis and phylogenetic tree of several mammals.
figure 4

(a) Phylogenetic tree constructed with the fourfold degenerate sites of 8,325 single-copy genes. Estimates of divergence time and its interval based on sequence identity are indicated at each node. (b) Venn diagram showing the number of unique and shared gene families among nine sequenced mammalian species. (c) Dynamic evolution of orthologous gene clusters. The estimated numbers of orthologous groups (16,998) in the most recent common ancestral species (MRCA) are shown at the root node. The numbers of orthologous groups that expanded or contracted in each lineage are shown on the corresponding branch; +, expansion; −, contraction.

The assembly contained 262 rRNA, 829 tRNA and 1,010 small nuclear RNA genes (Supplementary Table 11). We also identified 487 microRNA (miRNA) genes, of which 157 were located in 44 genomic clusters containing from 2 to 46 miRNA genes (Supplementary Fig. 10a). This distribution pattern is similar to that of cattle. However, there are several miRNA gene clusters specific to goats (Supplementary Fig. 10b). The largest miRNA gene cluster is located on goat chromosome 21 (Supplementary Fig. 11), which is a conserved mammalian miRNA cluster. We used miRNA sequences in other species (human, cattle, dog, chimpanzee, mouse and rat) to identify goat-specific miRNA genes, and found six goat-specific miRNA genes in total (Supplementary Table 12), which have typical miRNA structures (Supplementary Fig. 12) and many target genes (Supplementary Fig. 13).

Based on pair-wise protein sequence similarity, we carried out a gene family clustering analysis on all goat genes compared to genes in cattle, horse, dog, mouse, rat, opossum, chimpanzee and human. The 19,607 goat genes could be clustered into 15,628 gene families (Fig. 4b and Supplementary Table 13). We identified 40 goat-specific gene families that contain 106 genes; and 43 of these genes were expressed in the ten sequenced tissues (Supplementary Table 14). Of the 115 genes found within the 90 ruminant-specific gene families, 68 were expressed (Supplementary Table 15). These lineage-specific gene families may have contributed specifically to the evolution of goats or ruminants.

We also analyzed gene-family expansion or contraction events in the goat, cattle, horse, dog and human. In all five genomes, we found a higher frequency of gene-contraction than gene-expansion events (Fig. 4c), which has been noted previously28. We focused on the most significant expansion or contraction events (P < 0.01), and after manually filtering out gene families whose members had different function assignments (Supplementary Table 16), we detected three expanded olfactory receptor gene subfamilies but only one contracted subfamily in the goat compared with the cattle, horse, dog and human. It is possible that the olfactory receptor expansion events may contribute to the exceptional foraging ability of goats29. We also noticed expansion of the ferritin heavy chain gene (FTH1) family in the goat, with the number of goat FTH1 genes nearly seven times that of the human and two times of the cattle. The expansion of FTH1 in goats may account for its unusual detoxification ability and thus its broad forage diet, as ferritin plays a major role in iron sequestration, detoxification and storage30.

Two contigs containing the sheep major histocompatibility (MHC) loci (also designated as ovine lymphocyte antigen or OLA) generated with BAC-by-BAC sequencing31 were used to search against the CHIR_1.0 for goat MHC loci. As expected, the goat MHC loci were located on chromosome 23 in our assembly. Similar to the sheep MHC, the goat MHC contains two regions 2.25 Mb and 360 kb in length, respectively (Fig. 5a). Based on a comparison to the sheep MHC, which contains 177 genes, we annotated 160 protein-coding genes (Supplementary Table 17) using the same method as for the sheep MHC annotation31. Based on the annotation, we also analyzed conserved genes of MHC loci in sheep, goat and human. Even though there are some inversions, which are common for MHC loci, most of the conserved genes show high colinearity among goat, sheep and human (Fig. 5b). These results not only indicate that our assembly of the goat genome is of good quality, but also provide a detailed map for the goat MHC, which will be useful for immunological studies and vaccine development.

Figure 5: Comparative analysis of the goat MHC.
figure 5

(a) A map of the goat MHC and colinearity of the goat MHC with the sheep MHC (OLA). Green lines show syntenic relationship between the goat MHC and OLA. Only genes within the goat MHC are marked. Also shown, GC content (%, with nonoverlapping, 1 kb windows). (b) Genes conserved between goat MHC (GLA), sheep MHC (OLA) and human MHC (HLA) are connected by black lines.

Transcriptomes of primary and cashmere hair follicles

Mammalian hair is a highly keratinized tissue produced by hair follicles within the skin. There are two kinds of hair follicles: the primary hair follicle produces the coarse coat hair in all mammals, and the secondary hair follicle can produces the cashmere or 'fine hair' in certain mammals, including goats and antelopes32 (Supplementary Fig. 14). Characterized by its fine and soft features, cashmere fiber has been obtained mainly from the cashmere goat. Despite a 2,500-year history and enormous raw cashmere production, estimated at some 10,000 tons per year in China, the world's largest producer of cashmere33, little is known about the molecular mechanisms of cashmere formation and development.

We investigated the genetic basis underlying the development of cashmere fibers by sequencing the transcriptomes of primary and secondary hair follicles and mapping the reads to the goat genome assembly and annotated genes. RNA was extracted from 20–50 secondary or primary hair follicles of an Inner Mongolia cashmere goat, yielding 144–588 ng RNA per sample, and the transcriptomes of three pairs of primary and secondary hair follicle samples (three biological replicates) were directly sequenced without amplification, generating 20.3 Gb of sequence data (Supplementary Table 18). The majority (75%) of the total FPKM (fragments per kilobase of exon per million fragments mapped) values in both hair follicles were from keratin and keratin-associated protein genes (Supplementary Table 19). Across all three paired samples, we identified 10,077 genes in the primary hair follicle samples and 7,772 genes in the secondary hair follicle samples with FPKM > 0.1. Of the 2,572 genes in the primary hair follicle samples and 1,947 genes in the secondary hair follicle samples with FPKM > 5, 51 showed a change in expression of at least twofold between all three pairs of secondary and primary hair follicle samples (Supplementary Table 20), with 28 downregulated and 23 upregulated in secondary versus primary follicles.

Keratin and keratin-associated proteins are the main structural proteins of hair fibers, determining the quality of fiber. Two types (type I and type II) of keratins are paired to form obligatory heteropolymers34, whereas the keratin-associated proteins may be responsible for forming the rigid hair shaft and altering the hair structure and diameter35. In total, we annotated 49 keratin genes (Supplementary Table 21) and 30 keratin-associated protein genes in the goat genome (Supplementary Table 22), of which 29 keratin genes and all 30 keratin-associated protein genes were detected with FPKM > 5 in both types of follicles (Supplementary Table 23). Notably, two of the 29 keratin genes and 10 of the 30 keratin-associated protein genes were consistently differentially expressed between primary and secondary hair follicles in all three sample sets, and all of these were expressed higher in secondary than in primary follicles (Supplementary Table 24), suggesting that the keratin-associated protein genes may be more important in determining the structure of cashmere fibers. The two differentially expressed keratin genes (keratin 40 and 72) were type 1 and type 2, respectively (Supplementary Fig. 15). Keratin-associated proteins can be divided into three major groups: high sulfur, ultra-high sulfur and high glycine-tyrosine36. The ten differentially expressed keratin-associated proteins were all in the ultra-high sulfur group (Supplementary Fig. 16), suggesting that this group of proteins may be important for the formation of cashmere.

Other upregulated genes in secondary hair follicles include fibroblast growth factor 21 (GOAT_ENSP00000222157), which can promote the transition to catagen37, and casein kinase Iɛ(goat_GLEAN_10015556), an important regulator of β-catenin in the Wnt pathway, which is one of the most important pathways in hair follicle development38.

The downregulated genes in secondary hair follicles included two enzymes of amino acid biosynthesis, asparagine synthetase (goat_GLEAN_10019946) and phosphoserine aminotransferase (GOAT_ENSP00000388939), which are key enzymes in asparagine and serine biosynthesis, suggesting these amino acids could be more intensively involved in primary hair growth. Other downregulated genes included Gap junction alpha-1 protein (goat_GLEAN_10013034) and Desmoglein 1 (GOAT_ENSBTAP00000018382), which have been reported to be involved in hair follicle cell communication and hair follicle morphogenesis38,39, and isopentenyl-diphosphate delta-isomerase 1 (GOAT_ENSBTAP00000005323) and cellular retinoic acid–binding protein 2 (GOAT_ENSBTAP00000007515), which are both related to retinoic acid biosynthesis and can regulate hair growth and the hair life cycle through Wnt signaling40,41. Further analyses of these expression data generated additional hypotheses related to genes and pathways that may underlie cashmere fiber production (Supplementary Discussion, Supplementary Fig. 17 and Supplementary Tables 25–29).

Discussion

The goat genome is, to our knowledge, the first large genome to be sequenced and assembled de novo using whole-genome mapping technology, demonstrating that this approach can be used to obtain a highly contiguous assembly for a large genome without the aid of traditional genetic maps. The long super-scaffolds provide sufficient linkage-group information for gene mapping and marker-assisted breeding, and they are long enough to be anchored onto chromosomes using rough colinearity information of other closely related mammals whose complete genomes are available. We plan to update the goat genome assembly as radiation hybrid maps for all chromosomes become available.

The goat genome sequence will be useful for mapping reads obtained by resequencing more breeds of goats, which will facilitate the identification of SNP markers for genomic marker–assisted breeding. To our knowledge, the goat is the first small ruminant whose genome has been sequenced. The goat genome should be useful for understanding the genomic features that distinguish ruminants from nonruminant species. It will also be useful for improving the utility of the goat as a biomedical model and bioreactor. In addition, the genes we identified that are related to cashmere fiber production could be used as markers for breeding better cashmere goats, or they may be potential targets for genetic or nongenetic manipulation.

Methods

DNA/RNA isolation, library construction and sequencing.

Genomic DNA was isolated from liver tissue of a female Yunnan black goat by standard molecular biology techniques. DNAs were sheared to fragments of 180–800 bp, 2 kb, 5 kb, 10 kb and 20 kb to generate the PE libraries (see Supplementary Methods for details). All these DNA libraries were sequenced on the Illumina Genome Analyzer II platform.

High-quality DNA extracted from liver tissue of the female Yunnan black goat was used for the fosmid library construction (see Supplementary Methods for details). Fosmid end sequencing was done in this order: fragmentation and end repair, size selection and purification, circularization of digested linear DNA, inverse PCR and enrichment of 400–700 bp DNA fragments (see Supplementary Methods for details). Illumina sequencing for short insert libraries was then done.

RNA was purified using TRIzol (Invitrogen). RNA sequencing libraries were constructed using the mRNA-Seq Prep Kit (Illumina, USA). We sequenced 200 bp paired-end libraries of RNA-seq using the paired-end sequencing module (90 bp at each end) of the Illumina HiSeq 2000 platform (see Supplementary Methods for details).

Scaffold construction.

Goat genome scaffolds were constructed using SOAPdenovo software (Release 1.03, http://soap.genomics.org.cn/, parameter “-K 41 -d 1 -M 2 -F,” see Supplementary Methods for details). The end sequences from the fosmid library were used to extend the scaffolds using the procedure described in Supplementary Methods.

Whole-genome mapping (WGM) and construction of super-scaffolds.

We used the new WGM technology developed by the Argus System and WGM software package (Genome-Builder) of OpGen to produce huge optical mapping data, process these data and extend the scaffolds fully automatically. The system integrates wet lab chemistry, including digestion and staining, into an automated process using MapCard and MapCard Processor, followed by automatically collection of over 7,000 fluorescence microscope images per MapCard by the Argus Mapper instrument. The efficient version of Genome-Builder for small data sets is embedded in the Argus System, but for large data sets the Genome-Builder needs to be installed in a computer server.

High molecular weight DNA from fibroblast cell line from skin tissue of the female Yunnan black was passed through a channel-forming device (CFD) to direct and stretch the single DNA molecules onto a positively charged glass surface in MapCard, which had separate chambers for all reagents to be pre-loaded (see Supplementary Methods for details). DNA was elongated and immobilized to the surface after it flowed down through the micro channels of CFD. Fixing the DNA to the surface prevented recoiling, ensuring optimal orientation of the DNA for image capture by the CCD camera. The immobilized single molecules of DNA were digested with SpeI for 10 min at 37 °C and subsequently stained with JOJO-1 (Life Technologies) on the MapCard Processor (MCP) (Supplementary Fig. 4a). The MCP automates the sequential restriction enzyme digestion and staining steps.

Individual DNA molecules and corresponding restriction fragments were imaged by laser-illuminated fluorescent microscopy using the Argus Mapper (see Supplementary Methods for details). The restriction enzyme cut sites were detected as gaps in DNA images, and the size of each restriction fragment between adjacent cut sites was determined (Supplementary Fig. 4c). The Mapper analyzes the images channel by channel, filters out nonlinear distorted fragments and small molecules, identifies gaps between fragments and measures size of retained high-quality fragments. For this project, 3,447,997 single-molecule restriction maps (>250 kb) with an average size of 360 kb were generated. The total size of single-molecule restriction map data was about 1,241 Gb.

Super-scaffolding with WGM data.

Super-scaffolding with WGM data was performed using Genome-Builder software recently developed at OpGen. This software suite takes a hybrid approach to perform long-range scaffolding of de novo sequence assembly. Briefly, it uses single-molecule maps generated in Argus to extend sequence scaffolds, create overlapping regions between adjacent scaffolds and connect the scaffolds based on pair-wise alignments between them. The input sequence scaffolds were based on the de novo assembly and were first converted into restriction maps by in silico restriction enzyme digestion. The resulting in silico maps were used as initial seed maps for an iterative extension process. The details of the algorithm are further described in Supplementary Figure 4 and Supplementary Methods.

The super-scaffolds were evaluated by CEGMA17 and ESTs downloaded from NCBI (13,849 records) and de novo assembled ESTs from RNA-seq reads of ten tissues (99,707 records, see Supplementary Methods for details).

Pseudo-chromosomes assembly and evaluation.

A set of 108,850 source sequences for SNP probes from the OvineSNP50 BeadChip and BovineSNP50 BeadChip were compared for similarity to the goat super-scaffolds/scaffolds and the cattle genome (UMD 3.1) with BLASTN to locate the super-scaffolds and other scaffolds unmapped in the whole-genome mapping into the 30 pseudo-chromosomes (see Supplementary Methods for details). The WGM data were then used to double-check the order of anchored super-scaffolds or scaffolds. Dislocations confirmed by our experimental WGM data were candidates of true alignment differences caused by noncolinearity between the goat and cattle. Files of scaffolds, super-scaffolds and their corresponding .agp instruction files are available at http://goat.kiz.ac.cn/GGD/.

To evaluate the chromosomal assembly of the goat genome, we compared the pseudo-chromosome 1 with two radiation hybrid maps of goat chromosome 1 generated by us for a male Boer goat. We genotyped 1,222 SNP markers on the Illumina BovineSNP50 BeadChip and 1,567 SNP makers on the Illumina OvineSNP50 BeadChip that could be represented in the goat genome across a 5,000 rad goat-hamster radiation hybrid panel that contained 93 cell lines of a Boer goat. We also conducted synteny-based chromosomal comparisons for all chromosomes between goat CHIR_1.0 assembly and both bovine Btau_4.0 and UMD3.1 assemblies (see Supplementary Methods for details).

Genome annotations and analyses.

Genome annotations include the annotation of repeat elements (transposable elements), protein-coding genes, non-protein-coding genes and gene families. Based on the genome annotations, genome analyses focused on genes under positive selection and gene family evolution. The brief methods were listed here, and details are fully described in Supplementary Methods.

Tandem repeats in the genome assembly were identified using Tandem Repeat Finder42. Noninterspersed repeats in the genome were detected by using RepeatMasker43. Transposable elements in the genome assembly were identified at the DNA and protein levels. At the DNA level, RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) and LTR_FINDER44 software were used to build de novo repeat libraries. RepeatMasker (version 3.2.9) was run against de novo library and repbase45 separately to identify homologous repeats, which were classified into known classes of repeats46. At the protein level, RM-BLASTX within RepeatProteinMask in RepeatMasker software package was used against the transposable elements protein database.

To predict protein-coding genes, information was integrated from three different methods, ab initio prediction, homology-based annotation and RNA-seq-/EST-/cDNA-based annotation. RNA-seq data were used to extend the gene sequences, especially 5′ UTRs where high GC content usually hinders Illunima sequencing.

InterProScan47 (version 4.5) was used to screen goat proteins against five databases (Pfam48, release 24.0; PRINT49, release 40.0; PROSITE50, release 20.52; ProDom51, 2006.1; MART, release 6.0). The KEGG52,53 (Release 58), Uniprot/SwissProt54 (Release 2011.6) and UniProt/TrEMBL55 (Release 2011.6) database were searched for homology-based function assignments (GO assignments).

The tRNAscan-SE56 (version 1.23) software with default parameters for eukaryote was used for tRNA annotation. rRNA annotation was based on homology information of human rRNA collections using BLASTN (version 2.2.21). The miRNA and small nuclear RNA genes were predicted by INFERNAL57 software against the Rfam database58.

The Treefam59 methodology was used to define a gene family as a group of genes descending from a single gene in the common ancestor of goat, cattle, horse, dog, mouse, rat, opossum, chimpanzee and human (Ensembl release 56).

Single-copy genes defined as orthologous genes by Treefam pipelines were chosen for phylogenetic analysis with MrBayes software60. The Bayesian Relaxed Molecular Clock (BRMC) approach was used to estimate the species divergence time using the program MCMCTREE (version 4), which was part of the PAML package61. The divergence time of human and dog from TimeTree database (http://www.timetree.org/) was used as the calibration time.

CAFE (computational analysis of gene family evolution, version 2.1)62 was used to detect gene family expansion and contraction in human, dog, horse, cattle and goat, respectively. Ka/Ks ratios were calculated for 14,906 orthologous pairs among goat, cattle and human using KaKs_Calculator63 software and positive selection were further tested these gene pairs.

Comparison between primary hair follicle and secondary hair follicle transcriptomes.

After filtering low-quality/contaminated/PCR artifacts reads, reads from RNA-seq data of primary hair follicle (PHF) and secondary hair follicle (SHF) paired samples were mapped against the goat assembly using Tophat64. FPKM value was calculated for each protein-coding gene by Cuffdiff (http://cufflinks.cbcb.umd.edu). The significance level (P-value) of differential expressed genes between two samples was calculated with Cuffdiff using default parameters. FPKM > 5 was used as the stringent cutoff to identify expressed genes, respectively. Differentially expressed genes are those with at least twofold FPKM change between PHF and SHF samples in all three PHF/SHF comparisons.

Accession code.

The goat whole-genome shotgun project, DDBJ/EMBL/GenBank: AJPT00000000. DNA sequencing short reads, SRA: SRA051557. RNA sequencing short reads, GEO: GSE37456. Because there is no data bank for storing optical mapping data yet, the whole-genome mapping data of goat can be obtained from http://goat.kiz.ac.cn/GGD/. The radiation hybrid map data of chromosome 1 are also available from http://goat.kiz.ac.cn/GGD/.