An unbiased index to quantify participant’s phenotypic contribution to an open-access cohort

The Personal Genome Project (PGP) is an effort to enroll many participants to create an open-access repository of genome, health and trait data for research. However, PGP participants are not enrolled for studying any specific traits and participants choose the phenotypes to disclose. To measure the extent and willingness and to encourage and guide participants to contribute phenotypes, we developed an algorithm to score and rank the phenotypes and participants of the PGP. The scoring algorithm calculates the participation index (P-index) for every participant, where 0 indicates no reported phenotypes and 100 indicate complete phenotype reporting. We calculated the P-index for all 5,015 participants in the PGP and they ranged from 0 to 96.7. We found that participants mainly have either high scores (P-index > 90, 29.5%) or low scores (P-index < 10, 57.8%). While, there are significantly more males than female participants (1,793 versus 1,271), females tend to have on average higher P-indexes (P = 0.015). We also reported the P-indexes of participants based on demographics and states like Missouri and Massachusetts have better P-indexes than states like Utah and Minnesota. The P-index can therefore be used as an unbiased way to measure and rank participant’s phenotypic contribution towards the PGP.


Supplementary Tables Description
-All Data This table shows a data dump of the Harvard Personal Genome Project (PGP) of every participant (https://my.pgp-hms.org/users) after going through the survey normalization process (See Materials and Methods). This data dump is performed on July 6 th , 2016.
HUID -Harvard University Identity, the primary key for every participant of the PGP. They key begins with the letters "hu" followed by 6 alphanumeric characters. This key is unique for every PGP participant.
Fieldname -The field or phenotype recorded on the participant's public PGP profile.
Value -The value of the above field or phenotype.

Table S2 -Phenotype Score
This table shows the phenotype score of all the phenotypes determined from processing Table  S1.
Phenotype -All phenotypes formatted to lower case.
Phenotype Score -The Phenotype Score, which is the number of participants with a valid value for that phenotype.
Status -The status indicating whether the phenotype is included in the P-index calculation, which can be "Include" or "Exclude".

Table S3 -Disease Surveys
This table shows the phenotypes from the 12 disease surveys available to participants. The results on this table aggregate the phenotypes present in the surveys as well as medical record data that participants upload.
Phenotype -The phenotypes listed for each of the survey.
No. of Yes -The number of participants indicating that they are affected by the phenotype.
No. of No -The number of participants indicating that they are unaffected by the phenotype.
% Yes -The percentage of participants affected by the phenotype.

Table S4 -Disease prevalence
Like Table S3, this table shows the phenotypes from the 12 disease surveys available sorted by  disease prevalence. Phenotype -The phenotypes listed for each of the survey.
No. of Yes -The number of participants indicating that they are affected by the phenotype.
No. of No -The number of participants indicating that they are unaffected by the phenotype. % Yes -The percentage of participants affected by the phenotype.

Table S5 -P-index ranking
The rankings of all participants using the P-index. Table S1).

HUID -Harvard University Identity (See
Points -The total number of points awarded to each participant. Use as a measure to calculate the P-index. P-index -The index used to rank every participant, which is a normalized score of the number of points for each participant.
Rank -The ranking for each participant ranging from 1 to T3667. The letter "T" preceding the rank indicates 2 or more ties, i.e. 2 or more participants have the same P-index and therefore are tied in ranking.

Table S6 -P-index gender
The P-index of all participants that have indicated their gender. Table S1). Table S5).

P-Index -The index used to rank every participant (See
Gender -The gender/sex of the participant.

Table S7 -QtPhenotypeScore
The phenotype scores for the 88 quantitative traits/phenotypes used for the QtP-index (See Table S8).

Phenotype -The quantitative phenotype
Phenotype Score -The Phenotype Score for the quantitative phenotype

Table S8 -QtP-index ranking
The rankings of all participants using the quantitative P-index (QtP-index). Table S1).

HUID -Harvard University Identity (See
Points -The total number of points awarded to each participant from the quantitative trait phenotypes. Use as a measure to calculate the QtP-index.
QtP-index -The index used to rank every participant using the QtP-index.
Rank -The ranking for each participant ranging from 1 to T1642. The letter "T" preceding the rank indicates 2 or more ties, i.e. 2 or more participants have the same QtP-index and therefore are tied in ranking.

Table S9-P-index state
The P-indexes of participants who contributed information about which state within the United States they reside in.
HUID -Harvard University Identity (See Table S1). P-Index -The index use to rank every participant (See Table S5).
State -The state of residence of that participant.

Table S10-median P-index state
The median P-index of every state where there are at least 30 participants.
State -The state of residence of that participant.
No. of Participants -The number of participants from that state.
Median P-index -The median of all the P-index values of participants from that state. % P-index > 90 -The percentage of participants with P-index greater than 90 from that state.

Table S11-P-index zip code
The data used for mapping each participant to a more specific location using their submitted postal zip codes.
HUID -The Harvard University Identity (Table S1) P-index -The index used to rank every participant (See Table S5).
Provided zip code -The zip code provided by participants.
Formatted zip code -The actual zip code used.
City -The city located using the formatted zip code.
State -The state located using the formatted zip code.
Latitude, Longitude -The map coordinates to pinpoint the location of the participant.
Group ID -The local grouping of each participant. Participants in the vicinity are given the same Group ID.