Nature | Toolbox: Q&A

My digital toolbox: Ecologist Christie Bahlai talks data hygiene

Data wrangler recommends OpenRefine, DMPTool and Morpho.

Article tools

Rights & Permissions

Christine Bahlai

Christie Bahlai uses a variety of software packages to analyse data on invasive species that can damage crops.

In the first of a regular series, Christie Bahlai, an ecologist at Michigan State University in East Lansing, discusses the software and tools she finds most useful in her research.

How would you describe your research?

I’m interested in how insects that are important in agricultural systems respond to disturbances like invasive species and environmental change. For example, I recently completed a study examining the responses of several different native species of lady beetle to the arrival of an invasive lady beetle over a 24-year period in the US Midwest. I use collaboratively generated data in my own work, so a lot of my time is spent seeking out data sources and compiling them into usable formats — and doing analyses.

Which software, websites or tools do you use on a regular basis, and why?

The single greatest data management tool I’ve come across in the past year is OpenRefine. It is a fantastic web-based tool that streamlines the process of cleaning up messy data. And it is, to my knowledge, the only tool of its kind with a user-friendly graphical interface. When you’re dealing with large data sets, it’s inevitable that errors will creep in over time, especially if data entry is performed by multiple researchers. These tiny errors can lead to big problems down the road. The problem is that a lot of these errors are hard to see with the human eye, so they don’t get caught until much later in the scientific process — if they get caught at all. If you have a large data set or complex data, it is even harder to see and catch these errors.

For example, I first used OpenRefine on a very large, very messy data set. This particular data set had been collected over 9 years, and documented the abundance of just over 100 of species of aphid throughout the study period, leading to a data set that contained almost 700,000 observations. Because species names are not easy to spell, a lot of small errors crept in, not to mention several taxonomic revisions where species names changed, merged or split over the course of the study. When I first loaded this data set into my statistics software, it told me we had over 800 'species' — each typo was being counted as one species!

Visit the Toolbox hub for more articles

OpenRefine has a wide variety of applications, but the function I use most is called 'faceting', which allows you to quickly see all the different text strings that occur for each type of data in your database. In my data set, I could visually examine the list of species and replace any obvious typos, or use the 'cluster' function, where OpenRefine uses a variety of algorithms to locate similar text strings and suggests they be combined. It’s very efficient and has saved me hours of work that would have been spent manually cleaning data.

I found that OpenRefine was really intuitive. It guides you through the steps of importing your data, and its interface is clean and logical. The best part is that you can explore its functions without worrying about messing up your data, because it makes no changes to your original data file.

Which emerging tools do you have your eyes on?

I find it very exciting that people are taking a lot more interest in preserving their data for the future. I haven’t had a chance to use them yet, but I think DMPtool, a tool for helping scientists to develop data-management plans, and Morpho, a tool for helping scientists to develop informative metadata to accompany their data sets, will be big players in the coming years.

Would you recommend any websites, training courses or books for learning about scientific software?

I’m an instructor with Software Carpentry, a project of the Mozilla Science Lab, devoted to teaching basic scientific computing skills to scientists (see 'Boot camps teach scientists computing skills'). I’m also a member of a newer, related group, Data Carpentry. Our goal is to develop data-handling skills in scientists, from spreadsheet use to database management to basics of data visualization. Our lesson plans are available online. And I maintain a blog on data management, targeted at organismal biologists and ecologists, called Practical Data Management for Bug Counters.

Journal name:
Nature
DOI:
doi:10.1038/nature.2014.15896

For the best commenting experience, please login or register as a user and agree to our Community Guidelines. You will be re-directed back to this page where you will see comments updating in real-time and have the ability to recommend comments to other users.

Comments for this thread are now closed.

Comments

Comments Subscribe to comments

There are currently no comments.

sign up to Nature briefing

What matters in science — and why — free in your inbox every weekday.

Sign up

Listen

new-pod-red

Nature Podcast

Our award-winning show features highlights from the week's edition of Nature, interviews with the people behind the science, and in-depth commentary and analysis from journalists around the world.