Pantheon 1.0, a manually verified dataset of globally famous biographies

We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008–2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals.

three levels of aggregation (i.e.Arts/Fine Arts/Painting), and (iii) measures of global visibility (fame) including the number of languages in which a biography is present in Wikipedia, the monthly page-views received by a biography (2008-2013), and a global visibility metric we name the Historical Popularity Index (HPI).We validate our measures of global visibility (HPI and Wikipedia language editions) using external measures of accomplishment in several cultural domains: Tennis, Swimming, Car Racing, and Chess.
In all of these cases we find that measures of accomplishments and fame (HPI) correlate with an R 2 ≥50%, suggesting that measures of global fame are appropriate proxies for measures of accomplishment.

Previous efforts
Past studies to quantify culture have focused on either cultural behavior (norms, expectations, values, beliefs), or cultural production, but not both.Efforts to quantify cultural behavior, such as the World Value Survey 6 , have focused primarily on survey methods and self-reported questionnaires 7 .These methods have advanced important representations of cultural behavior, such as the well-known Inglehart-Welzel cultural map which organizes cultures along two-axis (secular versus traditional values, and survival versus self-expression values) 8 .
Past efforts to quantify cultural production and human accomplishments include Charles Murray's Human Accomplishments book and catalog, which contributed an inventory of 4,002 significant individuals within the domains of arts and sciences 9 ; the digitized text study self-branded by its authors as Culturomics 10 ; efforts focused on structuring Wikipedia data 11 and quantifying the impact of individuals across a more diverse set of cultural domains 12, 13 .Most efforts, however, have looked only at the popularity of individuals in a few languages (predominantly in English) and lack a controlled vocabulary of cultural domains that can be used to categorize individuals and their cultural products.The latter is an essential contribution of our datasets, since without a categorization of cultural domains it is not possible to study differences between the types of cultural products generated at different time periods and in different geographies.
Table 1 provides a non-exhaustive comparison surveying a sample of the various datasets previously constructed for quantifying culture, in comparison with the Pantheon 1.0 dataset that we document here.All of the aforementioned datasets capture and quantify specific aspects of culture or measure cultural production at a limited scope.Each of these datasets tracks a subset of five dimensions-cultural domains, occupations, time period, language and countrybut not all dimensions.Pantheon builds upon the previous work on characterizing the production of cultural accomplishments by introducing a new dataset that contributes a controlled vocabulary to classify biographies into cultural domains and manually curated time periods, geographies, languages and a full coverage of the five dimensions listed above.This allows us to link time, geography, gender, languages, and cultural domains.

Data Collection
To measure the impact and development of cultural expressions, it would be ideal to use data that encompasses all cultural expressions, across all languages, with high spatiotemporal resolution and disaggregated across a standardized cultural classification schema.Since no such dataset at this scale currently exists, we create a narrower dataset using data from Freebase and 277 language editions of Wikipedia  Since people are the source of cultural information, we use globally known individuals as a proxy for cultural production.Figure 1 above summarizes the main components of the workflow used to create the Pantheon dataset.We derive our dataset of cultural production from Freebase's entity knowledge graph and add metadata from Wikipedia accessible through its API.Freebase organizes information as uniquely identified entities with associated types and properties defined by a structured (but uncontrolled) data ontology.Therefore, to identify globally known historical and current cultural figures, we first determined a list of individuals through Freebase's database of all entities classified as Persons 18 .The individuals are filtered based on the availability of personal and demographic details.We considered only entities that included birthdate, birthplace, gender, and occupation.Next, we linked each individual to their English Wikipedia page using their unique Wikipedia article id, and from there we obtain information about additional language editions using the Wikipedia API.The set of individual ids represent a snapshot of Freebase's Persons entity as of May 2013, since the dataset of individuals was obtained in early 2013.We supplement the data with monthly page view data for all language editions from the Wikipedia data dumps 19 for page views for each individual from Jan. 2008 through Dec. 2013.
Since no globally standardized classification system currently exists for cultural production, we introduce a normalized hierarchy of cultural domains, industries and occupations.This new classification system allows us to classify biographies at three levels of aggregation.Figure 2 shows the entire cultural domain classification, with detail on the all three levels of the classification hierarchy.To create this classification hierarchy, we use raw data on individual occupations from Freebase to create a normalized listing of occupations -for example, we map "Entrepreneur", "Business magnate", and "Business development" to the normalized occupation of "Businessperson".We grouped normalized occupations into a second-tier classification (called industries), and top-level cultural domains.We associate individuals within the dataset to a cultural domain based on the occupation that best encompasses their primary area of cultural contribution.Thus, we link all individuals to exactly one occupation -for example, Barack Obama is a politician (although he is also listed as a writer on Freebase), and Shaquille O'Neal is a basketball player (although he is also listed as an actor on Freebase).By assigning individuals to only one primary cultural domain we avoid double-counting individuals within the dataset.
Yet we also introduce the limitation of restricting the contribution of polymaths to one singular domain.The challenge of fairly distributing the cultural impact of polymaths will be left for future consideration.From left to right: domain (i.e.Sports), industry (i.e.Team Sports) and occupation (i.e.Soccer Player).
In terms of location assignment, we attribute individuals to a place of birth by country, based on current political boundaries.We use present day political boundaries because of the lack of a historical geocoding API to attribute geographic boundaries using latitude, longitude, and time.Birthplaces were obtained by scraping both Freebase and Wikipedia, and further refined by using fuzzy location matching and geocoding within the Yahoo Placemaker 20 and Google Maps geocoding APIs 21 , and by manual curation.First, we normalize the raw data from Freebase indicating the city of birth by latitude and longitude using fuzzy location matching available within the geocoding APIs.Using the coordinates obtained through the APIs, individuals are then mapped to countries based on present-day geographic boundaries using reverse geocoding API available on geonames.org.For example, individuals born in Moscow during the Soviet Union era are associated with Russia.Using present-day boundaries allows for a consistent basis for matching individuals to countries, and mitigates the technological limitation of the lack of existing historical geocoding APIs for attributing geographic boundaries using latitude, longitude, and time.Historically, birthplace is a fairly suitable way of associating individuals to countries, however, given the increase of human mobility over time 13 and the net migration gains experienced by developed regions 22 , future refinement of the dataset may include consideration for improving the attribution of individuals to the most relevant geographic area from which their cultural contribution emerges.

Cultural Production, Visibility, and Fame
We introduce metrics of popularity that help us capture the relative visibility of each biography in our dataset.The fame, or visibility, of historical characters is estimated using two measures.The simpler of the two measures, denoted as L, is the number of different Wikipedia language editions that have an article about a historical character.
The documentation of an individual in multiple languages is a good first approximation for their global fame because it points to individuals associated with accomplishments or events that have been noted globally.The use of languages as a criterion for inclusion in our dataset helps us differentiate between global culture-information that has broken linguistic and geographic barriers and local culture-information that is contained primarily in a few language groups.
We also introduce the Historical Popularity Index (HPI), a more nuanced metric for global cultural impact that includes also information on the time elapsed since the birth of the historical character or his/her Age in the dataset (A) calculated as 2013birthyear, an L* measure that adjusts L by accounting for the concentration of pageviews among different languages (to discount characters with pageviews mostly in a few languages), the coefficient of variation (CV) in pageviews across time (to discount characters that have short periods of popularity), and the number of non-English Wikipedia pageviews (v NE ) to further reduce any English bias.In addition, to dampen the recency bias of the data, HPI is adjusted for individuals known for less than 70 years.
Equation 1 provides the full calculation for HPI.A detailed breakdown of HPI is provided in the Appendix.
Tables 2 and 3 show the ten people with the highest L and HPI, respectively, for a few selected periods.An individual is assigned to a period according to his or her date of birth.Here we see that the most notable biographies for each period are associated primarily with well-known historical characters.The dataset is restricted to the 11,341 biographies with a presence in more than 25 different languages in Wikipedia (L>25).The choice of the L>25 threshold is guided by a combination of criteria, based on the structure of the data and the limits of manual data curation.Figure 3 shows the cumulative distribution of biographies on a semi-log plot, as a function of the number of languages in which each of these biographies has a presence.Most of the 997,276 biographies surveyed have a presence in a few languages, such that the L>25 threshold is a high mark that can help filter the most visible of these

Biases & Limitations
As with all large data collection efforts, Pantheon is coupled with limitations and biases, which should be considered carefully when interpreting the dataset.This dataset should be interpreted narrowly, as a view of global cultural production that emerges from the multilingual expression of historical figures in the Wikipedia as of May 2013.The main biases and limitations of the dataset come from: 1.The use of Wikipedia as a data source.
2. The use of place of birth to assign locations.
3. The use of biographies as proxies for cultural production.

The use of Wikipedia as a data source
The data is limited by the set of people who contribute to Wikipedia.Wikipedia editors are not considered to be a representative sample of the world population, but a sample of publicly-minded knowledge specialists that are willing and able to dedicate time and effort to contribute to the online documentation of knowledge.Wikipedia editors have an English Bias, a Western Bias, a gender bias towards males, and they tend to be highly educated and technically inclined.They are also more prevalent among developed countries with Internet access 24 .Wikipedia also has a considerable bias in the inclusion of people from different categories.This bias could be the result of the differences in the notability criteria in Wikipedia for biographies from different domains 25 , or from systematic biases within the Wikipedia community 26 .
By using data from all Wikipedia language editions we are effectively reducing a bias that would favor the local culture of English speakers.As an example, we note that there is only one American Football Player in the dataset: O.J. Simpson.Certainly, his global notoriety is not purely from his football career, showing that the use of many languages reduces the English bias of the dataset (famous American Football players, such as Peyton Manning, Tom Brady and Joe Montana all have a large presence in the English Wikipedia, but fail to meet the L>25 threshold).In comparison, the dataset contains over 1,000 soccer players -showing that soccer is a sport that is integrated within global culture.

The use of place of birth to assign locations
Individuals were assigned to geographic locations using their place of birth, based on present-day political boundaries.Country assignments were complemented with geocoding APIs for normalization and manual curation (to correct for errors in API and completeness).Place of birth is one way of assigning a location to an individual that allow us to assign locations in a comprehensive and consistent manner.Yet, there are biases and limitations that need to be considered when using this location assignment method.An important limitation is the inability to account for individuals who became globally known after immigrating to another country.Would Neruda, Picasso or Hemingway be as famous if they had not participated of the Parisian art scene?The place where an individual was born is often different from the place where that individual made his or her more important contributions.In some cases, the contributions are made in a number of different places, and the use of birthplace is unable to capture where the contributions were made.This is particularly true for athletes who migrate to the world's most competitive leagues, or artists that move to the artistic centers of their time.In this dataset, such individuals are not represented since programmatically geo-coding birthplaces is more consistent than registering the place where each individual made his or her more significant contribution, which can only be found through the unstructured data buried in historical narratives.

The use of biographies as proxies for cultural production
Using biographies to approximate cultural production excludes accomplishments where a clear connection between a cultural expression and its creator is not apparent, or when a cultural accomplishment cannot be attributed to single individuals.For example, consider collective enterprises where the accomplishments are the results of teams and not isolated individuals.Examples of accomplishments that are likely to get excluded include the works of music bands or orchestras, or the products produced by a firm, where the accolades collected from accomplishments are connected to a firm, or brand, rather than to an individual.

Other Technical Limitations
Other biases and limitations include the volatility of Wikipedia and other online resources, which make the results presented here imperfectly reproducible.For example, the Yahoo Placemaker API, which was used for mapping individuals to countries by birthplace, has been deprecated recently and is no longer publicly available.Also, the set of included individuals is static and does not reflect events after early 2013 -as such, culturally impactful individuals who only recently rose to global prominence, including Pope Francis and Narendra Modi, are excluded from this dataset.

Data Records
The Pantheon dataset is publicly available on the Harvard Dataverse Network and can be accessed directly at: http://thedata.harvard.edu/dvn/dv/pantheon.The dataset is visualized at http://pantheon.media.mit.edu, a data visualization engine that allows users to dynamically explore the dataset through interactive visualizations.

2013.tsv (Data Citation 1).
The first file, pantheon.tsv, is a flattened tab-limited table, where each row of the table represents a unique biography.Each row contains the following variable fields: • name -name of the historical character (in English) • en_curid -unique identifier for each individual biography, maps to the pageid from Wikipedia * .To map to an individual's biography in Wikipedia, use the en_curid field as an input parameter to the following URL: http://en.wikipedia.org/?curid=[en_curid].
• countryCode-ISO 3166-1 alpha2 (based on present-day political boundaries) • countryCode3-ISO 3166-1 alpha3 country code (based on present-day political boundaries) • countryName -commonly accepted name of country • continentName -name of continent • birthyear -birthyear of individual * We use the English curid as the unique identifier in the Pantheon dataset; we confirmed that all biographies with L > 25 as of May 2013 had an entry in the English Wikipedia.
• birthyear -birthyear of individual • birthcity -given birthcity of individual • occupation -lowest level cultural domain (by their given occupation) • industry -second level classification of cultural domains • domain -highest level aggregation of cultural domains • gender -male or female • 2008-01 through 2013-12 -total pageviews for the given month (denoted by the column header)

Technical Validation
Following an approach similar to that of Murray (2003) 9 we use external measures of individual accomplishments to validate the Pantheon dataset.Unfortunately, many cultural domains are not characterized by external metrics of accomplishment that we can associate to individuals, so we restrict our validation to cultural domains where measures of individual accomplishment are available -namely, individual sports.The achievements of individual sportsmen and women can be quantitatively expressed through measures such as number of championship titles won or points scored.Here, we focus on Formula-1 drivers, tennis players, swimmers and chess players as independent case studies that we can use to validate our metrics of global visibility.

Formula One Racecar Drivers
First we examine the subset of the dataset containing the top 56 Formula-1 drivers, according to the number of languages in which they have a presence in Wikipedia.
For each of these drivers we created an additional dataset with the number of Grand Prix where x 1 is the number of podiums, x 2 is number of starts, and x 3 is an indicator for whether the individual is killed in action.b .

Tennis Players
Next, we conduct a similar analysis for Tennis Players.The Tennis player subset focuses on the top 52 Tennis players according to the number of languages in the Wikipedia and augmented by additional data on each individual -the number of weeks he/she spent as number one in the ATP or WTA, the number of Grand Slam wins, the top rank ever obtained, and the player's gender (Female = 1, Male = 0).We link the fame of biographies for Tennis Players to the aforementioned variables using an exponential function of the form: where x 1 is the number of weeks at the number one, x 2 is the number of Grand Slam wins, x 3 is highest rank obtained, and x 4 is the variable for gender.
For the number of language presences in Wikipedia (L), we construct a model which explains 34% of the variance in the multilingual presence of each of these individuals in the Wikipedia (Figure 5a).This shows that once again, the number of languages in Wikipedia is a good proxy for individual accomplishments.When we considered HPI, we find an improved model that explains 63% of the variation in HPI.
This further supports the use of HPI as an appropriate proxy for accomplishment, since HPI tracks the degree of achievement for tennis players better than L.  b .

Chess Players
Finally, we perform another analysis using all of the individuals classified as chess players in the Pantheon dataset.In this case, we use data on each individual's highest ELO ranking attained, gender, total games played, and percentage of wins, losses, and draws.We link the fame of chess players to these variables using an exponential function of the form: where x 1 is the highest ELO ranking attained, x 2 indicates gender, x 3 is the total games played, x 4 is the percentage of wins, x 5 is the percentage of losses, and x 6 is the percentage of draws.
For the number of language presences in Wikipedia (L), we construct a model that explains 37% of the variance in the multilingual presence of each of these individuals in the Wikipedia (Figure 7a).This further supports using the number of languages in Wikipedia as a proxy for individual accomplishments.Using HPI (Figure 7b), we find a model that explains 53% of the variation in HPI-demonstrating that HPI is an appropriate proxy for accomplishment, with an improved fit for tracking an individual's achievements.b .

Discussion
We introduced a dataset on historical cultural production based on biographies that have a presence in more than 25 language editions of Wikipedia.This manually curated dataset allowed us to link cultural domains (based on a classification we also introduced), to places and time.To distinguish between biographies with different levels of visibility we introduce two measures of fame: the number of languages in which an individual has a presence in Wikipedia (L), and the Historical Popularity Index (HPI).We validated these measures of global fame and visibility using external data on the accomplishments of Formula One racecar drivers, tennis players, swimmers, and chess players.In all these cases we find a good match between L, HPI, and the external measures of accomplishment, demonstrating the validity of the measures developed within Pantheon.While these case studies are not exhaustive across all cultural domains, they show that the measures introduced are effective metrics for characterizing cultural production across diverse sets of domains, time, and geography.Consider a Formula One racecar driver.Certainly, for a Formula One racer the number of Grand Prix won, or Championships, would be a better metric of accomplishment than the number of languages in Wikipedia.Yet, since Grand Prix won is a metric that applies only for Formula-1 drivers, it cannot be used for basketball players, swimmers, musicians or scientists.While imperfect, the measures based on the online presence of characters in diverse languages are appropriate proxies accomplishment and provide metrics that we can use to compare individuals from different cultural domains.
Additional Information (Appendix)

Figure 2 :
Figure 2: Cultural Domain Classification Hierarchy biographies.For example, a sampling of the individuals above the L>25 threshold includes globally known individuals such as Charles Darwin, Che Guevara, and Nefertiti.Below the threshold, we find individuals that are locally famous -such as Heather Fargo, who is the former Mayor of Sacramento, California.Also, 95% of individuals passing this threshold have an article in at least 6 of the top 10 spoken languages worldwide (Top 10 spoken languages by number of speakers worldwide: Chinese, English, Hindi, Spanish, Russian, Arabic, Portuguese, Bengali, French, Bahasa23 ), demonstrating that the Pantheon dataset has good coverage of non-Western languages.

Figure 3 :
Figure 3: Cumulative Number of Individuals with at least N Wikipedia Language Editions Wins, Championships Won, Podiums (number of times in the top 3), Starts, and a dummy variable for Killed in Action.These variables are used to construct a statistical model explaining the multilingual presence of each driver within Wikipedia as well as each driver's Historical Popularity Index.Since Grand Prix Wins, Championships and Podiums are highly collinear-and hence not statistically significant when used together-only Podiums are used in the final model.Since neither L nor HPI can be negative, we link the fame of biographies to the aforementioned variables using an exponential function of the form:

Figure 5 :
Figure 5: Validation Using Tennis Players

For
each biography i, we define: = Number of different languages editions of Wikipedia for biography i = Effective number of language editions for biography i where is the entropy in terms of Page Views and = total page views of individual i in language j A i = 2013 -Year of Birth of variation in page views = standard deviation in pageviews across all languages = average monthly pageviews = total pageviews in non-English editions of Wikipedia Using the above, the Historical Popularity Index (HPI) of an individual, i, is defined as:  =   +   * +    +    −     ≥    +   * +    +    −   − !    <

Table 1 :
Comparison Chart of Quantitative Datasets for Studying Culture

and had not yet accumulated a database as robust as what was available within Freebase.
. Both Freebase and Wikipedia are open-source, collaborative, multi-lingual knowledge bases freely available online to the general public.While previous efforts have also produced structured datasets based on Wikipedia 11, 12 , Pantheon introduces a new dataset including all language editions (not just English), and manually curated time periods, cultural

Table 2 :
Top 10 Biographies for each Time Period by Number of Language Editions, L

Table 3 :
Top 10Biographies for each Time Period by Historical Popularity Index, HPI