Genome-scale DNA variant analysis and functional validation of a SNP underlying yellow fruit color in wild strawberry

Fragaria vesca is a species of diploid strawberry being developed as a model for the octoploid garden strawberry. This work sequenced and compared the genomes of three F. vesca accessions: ‘Hawaii 4′, ‘Rügen’, and ‘Yellow Wonder’. Genome-scale analyses of shared and distinct SNPs among these three accessions have revealed that ‘Rügen’ and ‘Yellow Wonder’ are more similar to each other than they are to ‘Hawaii 4’. Though all three accessions are inbred seven generations, each accession still possesses extensive heterozygosity, highlighting the inherent differences between individual plants even of the same accession. The identification of the impact of each SNP as well as the large number of Indel markers provides a foundation for locating candidate mutations underlying phenotypic variations among these F. vesca accessions and for mapping new mutations generated through forward genetics screens. Through systematic analysis of SNP variants affecting genes in anthocyanin biosynthesis and regulation, a candidate SNP in FveMYB10 was identified and then functionally confirmed to be responsible for the yellow color fruits made by many F. vesca accessions. As a whole, this study provides further resources for F. vesca and establishes a foundation for linking traits of economic importance to specific genes and variants.

. Bin size is 500 kb, Y-axis is the number of total variants or variety-unique variants.

Figure S4. Enriched GO terms among genes (test set) affected by high impact variants in
all three accession. The reference set was derived from Phytozome, representing % genes belonging to each GO category.

Supplementary Methods
The following is a listing of the various scripts written to perform the data analysis presented in this paper. These scripts were run on Mac OS X for this paper and all should work on most Linux systems without modification provided the necessary tools are installed (Awk, Perl, Zsh, and R). Running them on Windows systems may require modification or the installation of a Unix environment.

VCF Filtering Scripts
The scripts in this category will take .vcf files and extract out only the lines matching the script's criteria. Most of them operate under the assumption that there are exactly three cultivars being analyzed and may not work for a different number of samples. They all use awk. Arguments are generally given to awk scripts using -v variable=value. These are all able to work with .eff.vcf files from snpEff. Most of these scripts take one or two sample numbers (from 0 to 2) to indicate which cultivar(s) you want them to act on. In our .vcf files the cultivars are numbered as follows: 0 = Hawaii 4 1 = Rügen 2 = Yellow Wonder

uniqueVariants.awk
This script extracts .vcf lines denoting loci where the given cultivar has no allele that is also found in either of the other two varieties. The variety to analyze may be specified by setting the "sample" variable.
The above will take the lines from infile.vcf where there is an allele unique to Hawaii 4 (sample 0) and save them in outfile.vcf.

heterozygousVariants.awk
This script extracts .vcf lines denoting loci where the given variety is heterozygous (has two or more alleles). The variety to analyze may be specified by setting the "first" variable.
The above will take the lines from infile.vcf where Yellow Wonder (sample 2) is heterozygous and save them in outfile.vcf.

homozygousVariants.awk
This script extracts .vcf lines denoting loci where the given variety is homozygous (has only one allele). The variety to analyze may be specified by setting the "first" variable.

differingVariants.awk
This script extracts .vcf lines denoting loci where the two given varieties differ (do not share any alleles). The varieties to analyze may be specified by setting the "first" and "second" variables.
The above will take the lines from infile.vcf where Yellow Wonder (sample 2) and Rügen (sample 1) do not share any alleles and save them in outfile.vcf.

commonVariants.awk
This script extracts .vcf lines denoting loci where the two given cultivars are the same, meaning that they are both homozogous for the same allele and do not share this allele with the remaining cultivar. The varieties to analyze may be specified by setting the "first" and "second" variables.
The above will take the lines from infile.vcf where Hawaii 4 (sample 2) and Rügen (sample 1) are homozygous for the same allele and this allele is not found in Yellow Wonder. It will save these lines in outfile.vcf.

VCF to CSV
The following scripts both extract information from a .vcf file and save it to a .csv file.

vcf2csv.awk
This extracts the chromosome, location, and variant list for each cultivar for each variant and saves them to a .csv file that can be opened in a spreadsheet program. Each cultivar has a column in the .csv file that lists the sequences of the alleles it has, separated with a vertical bar ("|") if more than one is present.
This script was used in the generation of the supplemental Excel files in the paper. This script is similar to the above but will only extract the chromosome and location of each .vcf line. Its output is intended for use in generating the histograms using the R scripts below. It requires Zsh to be installed (which it is by default in Mac OS but may not be in some Linux systems).
Fixing a GFF file for use with snpEff fix-gff.awk This script will fix a Phytozome .gff3 file so that snpEff can read it. The Fvesca_226_genes.gff3 file taken from Phytozome had the frame offset field for each exon calculated in a way different from what snpEff expects, and so snpEff did not interpret it correctly. This script was written to process the .gff3 file so as to correct this incompatibility. It is not guaranteed to work with other .gff files. The input and output should not be the same file.

Removing variants in high-read regions
The scripts in this category are used to remove variants that are in regions where the reads are too high.

extractHCRanges.awk
This script is the first stage of the process of cutting out variants that are in regions where the read count is too high (any such region likely represents a sequencing anomaly and all variants found there should therefore be considered unreliable). The task of this first script is to mark the regions of the genome within which variants should be excluded. It takes as input a coverage file produced by the genomeCoverageBed utility found in BedTools (http://bedtools.readthedocs.org). This utility should be run on the final .bam file, the one that is also fed into the variant calling software. Run it with the -bg and -ibam options, as in this example for Hawaii 4: This will produce the file Hawaii4.bg which can be fed into extractHCRanges.awk. A read cutoff may be specified to extractHCRanges.awk with -v cutoff=50 (50 was the cutoff used for the analysis described in this paper). To continue the same Hawaii 4 example: ./extractLocs infile.vcf > outfile.csv ./fix-gff.awk infile.gff > outfile.gff genomeCoverageBed -bg -ibam Hawaii4Recal.bam > Hawaii4.bg