GDC 2: Compression of large collections of genomes

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.


1000 Genomes Project
The 1000 Genomes Project (1000GP), "A Deep Catalog of Human Genetic Variation", describes the whole-genome sequence variation in human individuals. In out research we used data from Phase 1 of the 1000GP containing information about genomes of 1092 people, 525 males and 567 females. As the provided genotypes are phased, there are 2184 sequences in total (1092 diploid genomes).
The FASTA sequences were retrieved based on the VCF (Variant Call Format) files, which can be found at the 1000GP anonymous FTP servers. All used VCF files can be downloaded from the directory containing the final variant calls for the phase 1 datasets. It can be either the EBI or NCBI FTP site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ ftp://ftp.ncbi.nih.gov//1000genomes/ftp/phase1/analysis_results/integrated_call_sets/ For all individuals, variant calls for chromosomes 1-22 (autosomes) ale diploid and genotypes are phased. Therefore, the information about variants found on both chromosomes of chromosome pairs 1-22 is available for all individuals and it is possible to obtain 2184 sequences (2 for each individual). The data contain also information about pairs of chromosomes X for females (denoted by X-fem in the presented results). The situation is, however, more complex for male individuals due to two pseudoautosomal regions (PAR1 and PAR2) that are common between X and Y chromosomes. These pseudoautosomal regions are treated as autosomes in the description of the results of the 1000GP. Precisely, the diploid genotype calls for variants in two pseudoautosomal regions of X and Y chromosomes are stored in the available VCF file for X chromosome, while haploid genotype calls for non-pseudoautosomal regions (nonPAR, between PAR1 and PAR2) of X and Y chromosomes are stored in the corresponding VCF files (nonPAR of X in the VCF file for X chromosome, nonPAR of Y in the VCF file for Y chromosome).
Therefore, processing of X and Y chromosomes is divided into processing of five separate groups (which are treated as separate chromosomes): • X-fem-whole chromosome X of females, with diploid genotype calls (2 sequences per individual), • X-mal-nonPAR region of chromosome X of males, with haploid genotype calls (1 sequence per individual), • Y-mal-nonPAR region of chromosome Y of males, with haploid genotype calls (1 sequence per individual), • XY-mal1-PAR1 region of X and Y chromosomes of males, with diploid genotype calls (2 sequences per individual), • XY-mal2-PAR2 region of X and Y chromosomes of males, with diploid genotype calls (2 sequences per individual). Note that the above split is only according to the phasing status of the PAR regions in the 1000GP data. We had to use such split as we needed genome sequences to examine the existing compressors.
The scripts to generate FASTA sequences using VCF files from the Phase 1 of the 1000GP are available together with the distribution of GDC 2 (see Section 2.2).

The A.thaliana dataset
The A.thaliana dataset contains 775 strains from the 1001 Genomes Project. They are compressed against single reference sequence.

1001 Genomes Project
The 1001 Genomes Project (1001GP),"A Catalog of A.thaliana Genetic Variation", contains the whole-genome sequence variation in strains accessions of the plant A.thaliana.
The 775 FASTA sequences were retrieved based on the available information about variants found in related individuals. Described variants were found on the 5 Arabidopsis chromosomes (chr1, chr2, chr3, chr4, chr5) and on Chloroplast and Mitochondria chromosome (chrC, chrM). As most of the original data is still in waiting period status, because of legal regulations it cannot be redistributed or repackaged. Thus we are not allowed to make the resultant FASTA sequences (or scripts to produce them) publicly available.
Variants used: -all 1-, 2-and 3-symbol insertions (accessible in insertion.txt file for each strain, no quality data available), -filtered SNPs and 1-symbol deletions (accessible in filtered_variant.txt file for each strain, variants with quality greater or equal to 25 were processed).

The H.sapiens alternate assemblies dataset
The H.sapiens alternate assemblies dataset contains 11 different assemblies of the human genome. They are compressed against single reference sequence (other than other genomes in the set).

Reference sequence
The patch release 13 of Genome Reference Consortium Human Build 37 (GRCh37.p13, also referred to as hg19, released in 2013) was used as a reference sequence. It can be found at the GenBank and downloaded from the NCBI's anonymous FTP server:

Assembled genomes
The 11 FASTA sequences are different assemblies of the H.sapiens genome. The following genome assemblies were used: • hg17 (also referred to as NCBI35.

GDC package description
Genome Differential Compressor 2 (GDC 2), can be download in a gdc2.tar.gz package, which is is publicly available under a free license at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects& project=gdc&subpage=about. It includes all programs and scripts presented in this section.
The following commands should be used to decompress the package and go to the main package directory: tar -xzf gdc2.tar.gz cd gdc2 The main package directory contains two folders: 1. gdc_2, containing main GDC 2 program (see Section 2.1), 2. vcf2fasta, containing scripts to retrieve FASTA sequences from Phase 1 of the 1000GP (see Section 2.2).

GDC 2
The GDC 2 program is in the gdc_2 folder. It was implemented in C++11 language. Apart from proper compiler (e.g. gcc) and basic libraries, the compilation requires two specific libraries: • asmlib, "A multi-platform library of highly optimized functions for C and C++" 1 .
For linux and mac operating systems, the listed libraries are provided in the libs directory. However, the local versions may be used as well. The source codes and makefiles are placed in Gdc2 folder. Two makefiles, makefile.linux and makefile.mac, can be used to build a linux or a mac os application, respectively. The source of the appropriate libraries can be changed there.
Instruction to build GDC 2 on linux: make -f makefile.linux Instruction to build GDC 2 on mac os: make -f makefile.mac Created gdc2 executable is the GDC 2 program.

Scripts to retrieve FASTA sequences from the 1000GP
The vcf2fasta folder of the gdc2 package contains scripts and tools to download Phase 1 1000GP data and generate FASTA sequences for all chromosomes of each individual in the set. The generated sequences are identical to the sequences used in the experiments described in the main paper. As explained in Section 1.1.2, because of the two pseudoautosomal regions (PAR1 and PAR2) that are shared between X and Y chromosomes in male individuals and method of storing information about variants in 1000GP VCF files, instead of processing chromosomes X and Y, we generate (and process) five groups of sequences, each treated as separate chromosome (X-fem, X-mal, Y-mal, XY-mal1, XY-mal2).
The minimal steps to retrieve FASTA sequences (for chromosomes specified by CHROM variable, which is set in config.ini file) in a unix environment with gcc compiler and basic utilities available, are: cd src make cd .. ./run -abc The whole processing requires about 9 TB of disk space (most of which is swallowed by 6.7 TB of consensus data and 1.2 TB of original VCF data). It should also be noted that the whole processing for all sequences may take a lot of time (several days).
Two other variable in config.ini, $FTP, specifies the ftp site for download (EBI by default).
The available run switches:

• -h
Presentation of possible switches of the run script. All options are briefly described.
• -a Download (wget utility) and decompression (gzip utility) of the reference sequences and VCF files. The VCF files are placed in vcf folder. References of chromosomes 1-22 are placed in the separate folder, dedicated to the specific chromosome (chr1, chr2, chr3, ...). References of chromosomes X and Y are placed in the main directory, as they require preprocessing (see option -b) before the main treatment. It should be noticed that the download of the whole genome requires about 1.2 TB of disk space.

• -b
Preprocessing of the reference sequences and VCF files for X and Y chromosomes (which should be available in advance, see option -a). The reference FASTA sequences for PAR1, PAR2 and nonPAR regions are created with cut-ref (chrX-mal.fa, chrXY-mal1.fa, chrXY-mal2.fa, chrY-mal.fa). The processX program is used to create four VCF files, each describing variants corresponding to one of the X chromosome groups (X-fem, X-mal, X-mal1, X-mal2). The cut utility is used to remove data for the NA21313 (additional male individual, which does not occur in other VCF files) from the VCF file for Y chromosome.
• -c Creation of FASTA consensus sequences (extension *.fa) for all individuals from the reference sequence and VCF file (with use of the VCF2FASTA-d and/or VCF2FASTA-h programs). This is a very time consuming process and consensus sequences for all chromosomes require about 6.7 TB of disk space. The sequences are placed in the folders dedicated to the specific chromosome (chr1, chr2, chr3, ...).
Detailed description of all programs used by run script, together with their command line specifications: • cut-ref The program creates a new FASTA file by cutting a piece of the input FASTA sequence and adding range information to the header. It is used to produce reference sequences for nonPAR, PAR1 and PAR2 regions of chromosomes X and Y, for male individuals. Usage: cut-ref <input_name> <output_name> <start_pos> <end_pos> input_name -name of the input file output_name -name of the output file start_pos -start position of the cut sequence end_pos -end position of the cut sequence

• processX
The program processes the VCF file for X chromosome to create four VCF files: one for female individuals and three for male individuals. Each VCF file for male individuals corresponds to one of the region of interest of the chromosome X (PAR1, PAR2, nonPAR). Usage: processX <vcf> vcf -name of the VCF file for X chromosome • VCF2FASTA-h and VCF2FASTA-d The programs process the input reference sequence of a chromosome and corresponding VCF file with haploid (VCF2FASTA-h) or diploid (VCF2FASTA-d) genotype calls to create one (VCF2FASTA-h) or two (VCF2FASTA-d) FASTA consensus sequence(s) for each individual described in the VCF file. Names of the output consensus sequences are made by adding to the name of the VCF file: individual's ID, chromosome indicator in case for autosomal chromosomes ('1' or '2', VCF2FASTA-d only) and the ".fa" extension. All variants are taken into account, regardless of the value of the FILTER field (only a warning is outputted if it's different than "PASS" or "."). Usage: VCF2FASTA-h <vcf> <ref> [start_pos] VCF2FASTA-d <vcf> <ref> [start_pos] vcf -name of the VCF file ref -name of the reference sequence start_pos -start position of the reference sequence (optional, 1 by default)

Parameters of programs used in the experiments
The parameters used for tools compressing FASTQ files: • 7z a -md1024m -all EOL characters from input data were removed prior to compression • RLZ files -l 1 • GReEn • ABRC files ref file 1 1 • GDC -ma1000000000 -rn1 -denoted as GDC-normal • GDC -ma1000000000 -rn40 -denoted as GDC-ultra Supplementary Figure S1: Influence of the percent of 2nd level references on compression ratio, decompression (access) time of a single sequence, compression speed, decompression speed and RAM usage.