This page has been archived and is no longer updated

 
June 04, 2012 | By:  Paige Brown
Aa Aa Aa

What Statistics Can Teach Us About History (and how not to be afraid of math in a paper)

Guest blog post by Rachael Bloom.

If you're like me, you may have made it through college without a statistics class. I graduated with a biology degree which required lots of calculus, physics and a variety of chemistry. Then off to research in a biology lab, and the only formula I've ever really needed is C1V1=C2V2, and statistics. Lots and lots of statistics.

Changing whole curriculums aside - and I still have never found a use for calculus - how do you learn statistics on the job? And more importantly, how do you go about learning the statistics that are relevant to you in your science or day-to-day life?

What any professor will tell you is to go read a paper, and unsurprisingly, they're not wrong. This paper, published in Current Biology in 2008, is a great example. It isn't that recent, but it's fun, it involves Principal Components Analysis (PCA) and even more importantly it shows PCA in way that is easy to understand.

If you've never heard of Principal Components Analysis, don't worry: for our purposes it's not that complicated. In short, the analysis finds patterns in data sets of high dimensions. Most of us like to look at our data, but once you get data that exists on more than three axes (x, y, and z), it's hard to visualize. The analysis generates "principal components" for your data set of interest, and each principal component explains a certain amount of variance in your data set. Variance is the spread of the data in a data set. Imagine a cloud of points. A cloud that is smaller with all the points close together will have less variance than a large cloud with the same amount of points spread out. The principal components of a data set will show you the pattern in that cloud.

The authors of the paper, Lao et al. wanted to look at the genetic structure of European populations. They looked at 309,790 single-nucleotide polymorphisms (SNPs) in 2,457 different people from 23 different places (which the authors then go on to refer as "subpopulations"). They first looked for SNPs that would provide information on an individual's subpopulation and then performed a principal-component analysis on the data set.

What they found was that the first two principal components (PCs), which accounted for 31.6% and 17.3% of the total variation respectively, grouped the SNPs in a way that echoes a geographic map of Europe.

Remember, the basic concept of principal components is that they are calculated (with some fancy matrix algebra) to explain as much of the variation in the data as possible. From what we know about genetic drift in populations (how frequencies of gene variants change through populations-due to natural migration patterns and general changes over time) we would assume that the most genetic variation we would see would be due to the physical location of the subpopulations. Therefore, we would expect that our principal components would capture that variation.

And they do! The first PC (Eigenvector 1-it's just the fancy math term for the principal component) separated the subpopulations due to latitude, the Finnish subpopulation being at the top, and the two Italian subpopulations at the bottom. The second PC (Eigenvector 2) tended to separate the subpopulations by longitude.

Lao et al. used statistical analysis to confirm what we know about human migration history in Europe. And just by using the principal components of this genetic data, we can draw conclusions about more than just the genetic differences of subpopulations.

For example, Finnish (one of the official languages of Finland) is a Uralic language. The Uralic languages are a branch of languages that are spoken mostly around the Baltic Sea. This is in contrast with Swedish, the language spoken in Finland's neighboring country, which is a Northern Germanic language. These are completely independent languages, hinting at the migration patterns and origins of these two subpopulations: modern day Swedish populations moved up from central Europe (via the Netherlands) while most of Finland's inhabitants most likely moved east to arrive there.

You can actually see this on the plot of the principal components. Swedish, Norwegian, and Danish are a lot closer together literally and linguistically. While looking at a map of Europe you might be tempted to lump Finnish into this group, the genetic differences between those populations provide some insight into the history (linguistic and otherwise) of these very diverse populations.

PCA is usually one of the first analyses that statistical geneticists use for analyzing high-dimensional genome wide data (like SNP data from genome-wide association studies). It's the quickest way to identify and address population stratification-that is the inherent genetic differences between populations of different ancestral origins. It's also used for a number of other applications, facial recognition programs being just one example. Again, PCA can be useful for anything involving large amounts of data (to better understand ‘data', for a picture each pixel would be a data point) and trying to find a pattern in the variance. But still, as a student, my favorite example is this one-when a pattern created by the vectors of variance serves as a proof of principle for European history.

References:

Lao, O., T. T. Lu, et al. (2008). "Correlation between Genetic and Geographic Structure in Europe." Current Biology 18(16): 1241-1248.

And the Wikipedia articles on Uralic and Northern Germanic languages.

Rachael is a recent biology undergraduate graduate who is biding her time as a technician in a human genetics lab and contemplating grad school. Science interests include (but are not limited to): new tech, genetics, pretty pictures, microRNAs, and responsible journalism.

2 Comments
Comments
June 07, 2012 | 06:27 PM
Posted By:  Rachael Bloom
Hey Tom, thanks for your comments.

The principal components analysis really only focuses on the variants that occur in the data. The graph shown only indicates the Finnish population as an outlier from the first two principle components. Perhaps the next principle components would have further distinguished other subpopulations. Furthermore, the extrapolation of linguistic origins was purely my own. It was a handy example that fit well with this particular data set.

As for the Russians, you're probably correct, which is most likely the reason the authors chose European populations as the focus of this analysis.

June 06, 2012 | 02:51 AM
Posted By:  Tom Stasinski
This analysis is at best misleading if not useless. The fact that it correctly pinpoints the Fins as being genetically and linguistically different is coincidental since it completely fails to separate the Hungarians (of the same Uralic origins as the Fins and Estonians) from their Indoeuropean neighbors. By the way, the homeland of the Uralic languages is most likely near the Ural mountains, so the Fins most likely moved west and not east.
The graph also fails to confirm the genetic affinity between the Celtic populations of Ireland and the UK on one side and the coastal Iberia, from which they descended. Neither does it show any linguistic link between English and other Germanic languages, since inexplicably the Irish population stands in the way.
Finally, the Russian subpopulation (and many others) is not represented at all, possibly because its geographical location would throw a monkey wrench into the equations and ruin the neat picture.
Blogger Profiles
Recent Posts

« Prev Next »

Connect
Connect Send a message

Scitable by Nature Education Nature Education Home Learn More About Faculty Page Students Page Feedback



Blogs