Accounting for diverse evolutionary forces reveals mosaic patterns of selection on human preterm birth loci

Currently, there is no comprehensive framework to evaluate the evolutionary forces acting on genomic regions associated with human complex traits and contextualize the relationship between evolution and molecular function. Here, we develop an approach to test for signatures of diverse evolutionary forces on trait-associated genomic regions. We apply our method to regions associated with spontaneous preterm birth (sPTB), a complex disorder of global health concern. We find that sPTB-associated regions harbor diverse evolutionary signatures including conservation, excess population differentiation, accelerated evolution, and balanced polymorphism. Furthermore, we integrate evolutionary context with molecular evidence to hypothesize how these regions contribute to sPTB risk. Finally, we observe enrichment in signatures of diverse evolutionary forces in sPTB-associated regions compared to genomic background. By quantifying multiple evolutionary forces acting on sPTB-associated regions, our approach improves understanding of both functional roles and the mosaic of evolutionary forces acting on loci. Our work provides a blueprint for investigating evolutionary pressures on complex traits.


nature research | reporting summary
April 2020 For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.
Sample size

Data exclusions
Replication Randomization Blinding xmfa_parsimony.pl -Calculates ancestral reconstruction for each variant calc_r2.py -Calculates pairwise ld for input gwas snps using plink. calc_summary_stat.py -Calculates z-scores for each input locus per annotation across control snps. clump_snps.py -Takes GWAS summary statistics file 1) run plink clump and 2) bin all gwas variants by different levels of LD. combine_annotations.py -Combines many output files from extract_from_bed.py combine_control_sets.pl -Combines all the control snps with ld snps into one text file. One column per control set. expand_control_set.py -Takes gwas clumped input snps and its corresponding SNPSNAP control snps and add control snps in ld with lead snps. extract_all_snps_from_bed.py -Takes all control snps, split by chrm, run bedtools, and intersect on annotation file. extract_from_bed.py -Runs bedtools intersect on control set snps against a given annotation file. get_ld_partners.py -Calculates LD partners. get_rsID_from_input_gwas.py -Gets rsID from the input GWAS for given chr:pos coordinates. plot_annotation.py -Creates violin plots comparing the gwas input loci value for an annotation compared to control sets and will also calculate an empirical p-value per loci python_slurm_script.slurm Runs clump_snps.py qc_sets.py -Run after running expand_control_set.py. Searches through all files in log\, matching_sets\, and *_annotate\, then extracts the min and max value of the control sets from the file names, and reports missing sets in between min and max value snpsnap_overlap_with_annotation.py -Run bedtools intersect on snpsnap database snps and user specified annotation files. summarize_sets.py -Run after running expand_control_set.py. Concatenates the matching_summary.tsv file for each control set into one file and then summarize by various quality metrics and create plots updated_control_sets.py -Run after combine_control_sets.py. Adds columns to the combine_control_sets.tsv file if that control set does not exist.
All the data used in this study were obtained from the public domain (see the URLs below) or deposited in a figshare repository at DOI: 10.6084/ m9.figshare.c.4602905.
The original dataset of 10,000 variants was filtered for lead-variants with a p-value of 10E-4 before any downstream analysis. This was done to filter out the variants least likely associated with sPTB.
Biological or technical replicates do not apply to our computational analysis of sPTB associated variants.

NA
Blinding was not applicable for this analysis since selection of sPTB associated variants were selected based on GWAS P-value before any downstream analysis. No treatment of interventions are used in this study. Results reported are based on an unbiased quantitation based on