Recovery of mitogenomes from whole genome sequences to infer maternal diversity in 1883 modern taurine and indicine cattle

Maternal diversity based on a sub-region of mitochondrial genome or variants were commonly used to understand past demographic events in livestock. Additionally, there is growing evidence of direct association of mitochondrial genetic variants with a range of phenotypes. Therefore, this study used complete bovine mitogenomes from a large sequence database to explore the full spectrum of maternal diversity. Mitogenome diversity was evaluated among 1883 animals representing 156 globally important cattle breeds. Overall, the mitogenomes were diverse: presenting 11 major haplogroups, expanding to 1309 unique haplotypes, with nucleotide diversity 0.011 and haplotype diversity 0.999. A small proportion of African taurine (3.5%) and indicine (1.3%) haplogroups were found among the European taurine breeds and composites. The haplogrouping was largely consistent with the population structure derived from alternate clustering methods (e.g. PCA and hierarchical clustering). Further, we present evidence confirming a new indicine subgroup (I1a, 64 animals) mainly consisting of breeds originating from China and characterised by two private mutations within the I1 haplogroup. The total genetic variation was attributed mainly to within-breed variance (96.9%). The accuracy of the imputation of missing genotypes was high (99.8%) except for the relatively rare heteroplasmic genotypes, suggesting the potential for trait association studies within a breed.


Run 8 1000 bulls GATK fastq to GVCF guidelines (GATKv3.8)
Version: 17/10/2019 These specifications describe the software and steps to process fastq files into bam and GVCF files for the 1000 Bull Genomes Project.

Data accepted
Data accepted was only from ILLUMINA sequencers, not from PacBio or Oxford Nanopore instruments at this time.

Starting from raw fastq versus bam file extracted fastq
We recommend that partners process their sequence starting from raw fastq format. If you do extract reads from a bam file you can do so with Picard SamToFastq tool (https://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq ), though other tools are available. This tool can output reads based on read groups, so if read groups are specified correctly (see https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups for definitions) then this will re-create the original fastq files. Please make sure the original per base quality score (OQ) is associated with reads, Picards RevertSam tool (https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.6/picard_sam_RevertSa m.php ) can revert previously recalibrated qscores. If you don't know the quality control that was used on the read data in the bam file and you have raw fastq files available, please start from fastq as per our guidelines in this document.

Trim and filter fastq
Trim paired reads of adapter, low quality bases (qscore <20) at the beginning and end, then filter out reads with mean qscore less than 20 or length less than 35bp. We recommend Trimmomatic because it is well documented and actively maintained. However there are many programs capable of performing this task, ie qualityTrim ( Alternatively, you can use seqtk (https://github.com/lh3/seqtk) to convert qscores.
NOTE: If fastq files contain reads that fail Illumina chastity these should also be removed.
NOTE: Should you have Illumina two color chemistry e.g. NovaSeq or NextSeq data you should also trim strings of G from the end of reads, these strings have normal qscores and so most trimming scripts will not trim them, they are however artefacts of the sequencing chemistry. Such sequences may be flagged by FastQC as over represented sequences or kmers. See https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-gbases/ We highly recommend checking raw and filtered sequence reads with FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). NOTE: the -summary ${Fastq}.summary option is not documented in the V0.32 pdf as it was added in V0.38. At the end of processing, the program prints to screen the results of the trimming. Thesummary {file_name} option will also print these results to {file_name}. An example output is below. These data can be useful for evaluating the overall quality and levels of readthrough.

Reference Genome
ARS-UCD1.2_Btau5.0.1Y is the reference genome to be used in this project. This reference has the Btau5.0.1 Y chromosome assembly from Baylor College (Bellott, Hughes et al. 2014) added to ARS-UCD1.2 (Rosen, Bickhart et al. 2018). It can be downloaded from the 1000 bull genomes website and https://sites.ualberta.ca/~stothard/1000_bull_genomes/ There are several files including the .fa.gz file (assembly) and checksums to ensure the download has not altered the files. This exact copy of the reference genome must be used to ensure your bam and GVCF files are compatible with the 1000 Bull Genomes Project pipeline. Non-conforming files will be excluded from the run.

Map fastq
Map trimmed reads (pairs and singles that pass above QC) to the reference using bwa mem (https://github.com/lh3/bwa) specifying read groups (see https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups for definitions) with the following options where RGID is the sequencer lane (this is often within the fastq file name and is important for the base quality score recalibration steps later on), RGPL is the sequencing platform (ILLUMINA, SOLID or 454), RGLB is the library name and RGSM must be the international ID of the animal. Other read group tags can be populated but RGID, RGPL and RGSM are required. If your animal does not have an international ID, you should create one that conforms to Interbull standards, ie 3 character breed code + 3 character country code + sex code (M or F) + 12 character animal ID, eg HOLCANM000000352790. See http://www.interbull.org/ib/icarbreedcodes To perform the subsequent steps you will need to use samtools sort to sort your bam and samtools index to index your sorted bam file (http://www.htslib.org/doc/samtools.html). Using the correct reference will ensure the bam files are sorted correctly, i.e. 1, 2, 3, …, 29, X, Y, MT, other contigs.
Where multiple bam files are generated for an individual you should use Picard MergeSamFiles (https://broadinstitute.github.io/picard/command-line-overview.html#MergeSamFiles) to merge them. Please note that samtools merge is not appropriate for individuals with multiple libraries as it doesn't handle the read groups properly, so Picard MergeSamFiles is our chosen tool for this task. The correct handling of libraries is important for downstream base quality score recalibration in GATK which is read group aware.

Base Quality Recalibration
Base quality recalibration should be performed according to GATK best practises guidelines (https://software.broadinstitute.org/gatk/guide/article?id=44). This task uses the GATK BaseRecalibrator (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php ) and PrintReads (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php ) tools. Briefly, first GATK BaseRecalibrator is run to build a model of covariation based on the data and a set of known variants, to produce a recalibration table. Secondly, GATK PrintReads is run to adjust the base quality scores in the data based on the recalibration table, this produces a recalibrated bam. An optional third step runs GATK BaseRecalibrator on the recalibrated bam producing an "after recalibration"

Create GVCF file
Appendix C explains a known issue with HaplotypeCaller and threading for the unmapped contigs.
Create a GVCF file using GATK HaplotypeCaller (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php ). Follow the described "Single-sample GVCF calling on DNAseq (for `-ERC GVCF` cohort analysis workflow)" with the following options Where INTERNATIONALID is the international ID of the animal, NOTE: this INTERNATIONALID must match the international ID in the RGSM field in the read groups, which were added in the mapping steps above.
It is essential that GVCF files are gzipped and indexed (.tbi file). GATK will do both of these if you specify as above.

Calculate read coverage
It is important to know the coverage for a few reasons, one being to ensure compliance with the coverage requirements. Calculating coverage from the raw read numbers and length has been shown to be highly inaccurate of final coverage. Calculate the average read coverage using GATK DepthOfCoverage tool (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_coverage_DepthOfCoverage.php ). Include coverage statistic in the checklist (described below).