A polygenic risk score for nasopharyngeal carcinoma shows potential for risk stratification and personalized screening

Polygenic risk scores (PRS) have the potential to identify individuals at risk of diseases, optimizing treatment, and predicting survival outcomes. Here, we construct and validate a genome-wide association study (GWAS) derived PRS for nasopharyngeal carcinoma (NPC), using a multi-center study of six populations (6 059 NPC cases and 7 582 controls), and evaluate its utility in a nested case-control study. We show that the PRS enables effective identification of NPC high-risk individuals (AUC = 0.65) and improves the risk prediction with the PRS incremental deciles in each population (Ptrend ranging from 2.79 × 10−7 to 4.79 × 10−44). By incorporating the PRS into EBV-serology-based NPC screening, the test’s positive predictive value (PPV) is increased from an average of 4.84% to 8.38% and 11.91% in the top 10% and 5% PRS, respectively. In summary, the GWAS-derived PRS, together with the EBV test, significantly improves NPC risk stratification and informs personalized screening.

Four GWAS populations were used for the PRS construction: EPI-NPC-2005 sample (N=3433), NPCGEE sample (N=2089), SYSUNPC sample (N=4227) and Hong Kong sample (N=999). Two independent samples were used for the PRS validation: Guangdong sample (N=2181) and Xinjiang sample (N=701). A total of 1207 individuals from a prospective cohort (PRO-NPC-001) were used for evaluating the discriminatory power of the PRS.We assume that the allele frequency=0.15, the Odds ratio=1.3, the prevalence of NPC= 20 per 100 000, and!=5e-8. Under these assumptions, the sample size should be about 10 000 to achieve a power of 90% to identify significant variants. In this study, the sample size is about 15 000, which is sufficient to identify significant SNPs for NPC risk.
We conducted standard quality control at subject and SNP levels. In brief, low-quality variants were removed and subjects were excluded for the following reasons: (1) unintended technical errors or low genotyping quality; (2) if subjects were estimated to be biologically related to other subjects and with lower call rates; (3) if subjects' ancestral structure deviated from that of the underlying study population.
The PRS was well-replicated in two independent case-control samples from NPC endemic (Guangdong sample: N=2181) and non-endemic areas (Xinjiang sample:N=701), as well as in the prospective NPC screening cohort (PRO-NPC-001: N=1207).
In the prospective screening cohort with 178 960 person-years follow-up, a total of 89 incidence cases were histologically confirmed as NPC patients. In order to evaluate the discriminatory power of the PRS in the screening cohort, we randomly selected non-NPC controls matched to the 89 cases by sex and age with a ratio of 15:1, using propensity score method (R package: "MatchIt"), and among the randomly selected controls, a total of 1 118 have available samples and were used for polygenic risk score validation together with the 89 cases.
No blinding was performed. This was deemed unnecessary since no clinical trials were conducted in this study and none of the analyses reported involved procedures that could be influenced by investigator bias.
Briefly describe the study type including whether data are quantitative, qualitative, or mixed-methods (e.g. qualitative cross-sectional, quantitative experimental, mixed-methods case study).
State the research sample (e.g. Harvard university undergraduates, villagers in rural India) and provide relevant demographic information (e.g. age, sex) and indicate whether the sample is representative. Provide a rationale for the study sample chosen. For studies involving existing datasets, please describe the dataset and source.
Describe the sampling procedure (e.g. random, snowball, stratified, convenience). Describe the statistical methods that were used to predetermine sample size OR if no sample-size calculation was performed, describe how sample sizes were chosen and provide a rationale for why these sample sizes are sufficient. For qualitative data, please indicate whether data saturation was considered, and what criteria were used to decide that no further sampling was needed.
Provide details about the data collection procedure, including the instruments or devices used to record the data (e.g. pen and paper, computer, eye tracker, video or audio equipment) whether anyone was present besides the participant(s) and the researcher, and whether the researcher was blind to experimental condition and/or the study hypothesis during data collection.
Indicate the start and stop dates of data collection. If there is a gap between collection periods, state the dates for each sample cohort.
If no data were excluded from the analyses, state so OR if data were excluded, provide the exact number of exclusions and the rationale behind them, indicating whether exclusion criteria were pre-established.
State how many participants dropped out/declined participation and the reason(s) given OR provide response rate OR state that no participants dropped out/declined participation.
If participants were not allocated into experimental groups, state so OR describe how participants were allocated to groups, and if allocation was not random, describe how covariates were controlled.

Ecological, evolutionary & environmental sciences study design
All studies must disclose on these points even when the disclosure is negative. Note the sampling procedure. Describe the statistical methods that were used to predetermine sample size OR if no sample-size calculation was performed, describe how sample sizes were chosen and provide a rationale for why these sample sizes are sufficient.
Describe the data collection procedure, including who recorded the data and how.
Indicate the start and stop dates of data collection, noting the frequency and periodicity of sampling and providing a rationale for these choices. If there is a gap between collection periods, state the dates for each sample cohort. Specify the spatial scale from which the data are taken If no data were excluded from the analyses, state so OR if data were excluded, describe the exclusions and the rationale behind them, indicating whether exclusion criteria were pre-established.
Describe the measures taken to verify the reproducibility of experimental findings. For each experiment, note whether any attempts to repeat the experiment failed OR state that all attempts to repeat the experiment were successful.
Describe how samples/organisms/participants were allocated into groups. If allocation was not random, describe how covariates were controlled. If this is not relevant to your study, explain why.
Describe the extent of blinding used during data acquisition and analysis. If blinding was not possible, describe why OR explain why blinding was not relevant to your study.
Describe the study conditions for field work, providing relevant parameters (e.g. temperature, rainfall).
State the location of the sampling or experiment, providing relevant parameters (e.g. latitude and longitude, elevation, water depth).
Describe the efforts you have made to access habitats and to collect and import/export your samples in a responsible manner and in compliance with local, national and international laws, noting any permits that were obtained (give the name of the issuing authority, the date of issue, and any identifying information).
Describe any disturbance caused by the study and how it was minimized.

March 2021
Antibodies Tick this box to confirm that the raw and calibrated dates are available in the paper or in Supplementary Information.

Ethics oversight
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Animals and other organisms
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research Laboratory animals

Wild animals
Field-collected samples

Ethics oversight
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Human research participants
Policy information about studies involving human research participants

Population characteristics
Describe all antibodies used in the study; as applicable, provide supplier name, catalog number, clone name, and lot number.
Describe the validation of each primary antibody for the species and application, noting any validation statements on the manufacturer's website, relevant citations, antibody profiles in online databases, or data provided in the manuscript.
State the source of each cell line used.
Describe the authentication procedures for each cell line used OR declare that none of the cell lines used were authenticated.
Confirm that all cell lines tested negative for mycoplasma contamination OR describe the results of the testing for mycoplasma contamination OR declare that the cell lines were not tested for mycoplasma contamination.
Name any commonly misidentified cell lines used in the study and provide a rationale for their use.
Provide provenance information for specimens and describe permits that were obtained for the work (including the name of the issuing authority, the date of issue, and any identifying information). Permits should encompass collection and, where applicable, export.
Indicate where the specimens have been deposited to permit free access by other researchers.
If new dates are provided, describe how they were obtained (e.g. collection, storage, sample pretreatment and measurement), where they were obtained (i.e. lab name), the calibration program and the protocol for quality assurance OR state that no new dates are provided.
Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance was required and explain why not.
For laboratory animals, report species, strain, sex and age OR state that the study did not involve laboratory animals.
Provide details on animals observed in or captured in the field; report species, sex and age where possible. Describe how animals were caught and transported and what happened to captive animals after the study (if killed, explain why and describe method; if released, say where and when) OR state that the study did not involve wild animals.
For laboratory work with field-collected samples, describe all relevant parameters such as housing, maintenance, temperature, photoperiod and end-of-experiment protocol OR state that the study did not involve samples collected from the field.
Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance was required and explain why not.
In EPIC-NPC-2005, 2 472 males and 961 females were included and the age distribution for the cases and controls were 49 (42-57) and 49(42-57) years. A total of 4 835 244 SNPs were included in the GWAS analysis. In NPCGEE, 1 525 males and 567 females were included and the age distribution for the cases and controls were 47 (41-56) and 46(40-55) years. A total of 5 066 825 SNPs were included in the GWAS analysis. In SYSUNPC, 2 819 males and 1 408 females were included, the age distribution for the cases and controls were 43 (37-51) and 61(58-68) years. A total of 5 099 736 SNPs were included in the GWAS analysis. In Hong Kong study, 672 males and 327 females were included, the age distribution for the cases and controls were 53 (45-60) and 51(40-58) years. A total of 4 902 485 SNPs were included in the GWAS analysis. In the replication stage of Guangdong study, 1 479 males and 713 females were included, the age distribution for the cases and controls were 51 (43-57) and 39(34-47) years. In Xinjiang study, 392 males and 309 females were included, the age distribution for the cases and controls were 48 (42-57) and 36(29-47) years.

Recruitment
Ethics oversight Note that full information on the approval of the study protocol must also be provided in the manuscript.

Clinical data
Policy information about clinical studies All manuscripts should comply with the ICMJEguidelines for publication of clinical research and a completedCONSORT checklist must be included with all submissions.

Clinical trial registration
Study protocol Data collection

Dual use research of concern
Policy information about dual use research of concern

Hazards
Could the accidental, deliberate or reckless misuse of agents or technologies generated in the work, or the application of information presented in the manuscript, pose a threat to:

Experiments of concern
Does the work involve any of these experiments of concern: No Yes Demonstrate how to render a vaccine ineffective Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks.
For the six populations used for PRS development and replication, all NPC cases were newly histologically confirmed by at least two pathologists according to the World Health Organization classification criteria for NPC. The controls in the casecontrol populations were self-reported cancer-free individuals who were frequency matched to cases by geographical region and ancestry. For the PRO-NPC-001, the inclusion criteria include 1) being aged 30-59 years; 2) being Cantonese; 3) having no prevalent NPC; 4) having an Eastern Cooperative Oncology Group score of 0-2; and 5) having a good physical or psychological condition and consciousness. Exclusion criteria include 1) having severe cardiovascular, liver, or kidney disease or 2) having prevalent NPC. Written informed consent was obtained from each participant before the study.Participants received no compensation for participation.
The Institutional Review Board of Sun Yat-Sen University Cancer Center approved this study.
Provide the trial registration number from ClinicalTrials.gov or an equivalent agency.
Note where the full trial protocol can be accessed OR if not available, explain why.
Describe the settings and locales of data collection, noting the time periods of recruitment and data collection.
Describe how you pre-defined primary and secondary outcome measures and how you assessed these measures.

Data access links
May remain private before publication. For "Initial submission" or "Revised version" documents, provide reviewer access links. For your "Final submission" document, provide a link to the deposited data.
Provide a list of all files available in the database submission.
Provide a link to an anonymized genome browser session for "Initial submission" and "Revised version" documents only, to enable peer review. Write "no longer applicable" for "Final submission" documents.
Describe the experimental replicates, specifying number, type and replicate agreement.
Describe the sequencing depth for each experiment, providing the total number of reads, uniquely mapped reads, length of reads and whether they were paired-or single-end.
Describe the antibodies used for the ChIP-seq experiments; as applicable, provide supplier name, catalog number, clone name, and lot number.
Specify the command line program and parameters used for read mapping and peak calling, including the ChIP, control and index files used.
Describe the methods used to ensure data quality in full detail, including how many peaks are at FDR 5% and above 5-fold enrichment.
Describe the software used to collect and analyze the ChIP-seq data. For custom code that has been deposited into a community repository, provide accession details.
Describe the sample preparation, detailing the biological source of the cells and any tissue processing steps used.
Identify the instrument used for data collection, specifying make and model number.
Describe the software used to collect and analyze the flow cytometry data. For custom code that has been deposited into a community repository, provide accession details.
Describe the abundance of the relevant cell populations within post-sort fractions, providing details on the purity of the samples and how it was determined.
Describe the gating strategy used for all relevant experiments, specifying the preliminary FSC/SSC gates of the starting cell population, indicating where boundaries between "positive" and "negative" staining cell populations are defined.
Indicate task or resting state; event-related or block design.
Specify the number of blocks, trials or experimental units per session and/or subject, and specify the length of each trial or block (if trials are blocked) and interval between trials.
State number and/or type of variables recorded (e.g. correct button press, response time) and what statistics were used to establish that the subjects were performing the task as expected (e.g. mean, range, and/or standard deviation across subjects).