When they think about big data, most researchers probably imagine genomics, neuroscience or particle physics. Kelly Robinson’s data challenge involves plankton.
“A lot of things that we enjoy seafood-wise — from fish to oysters to mussels to shrimp — almost everything starts their lives as plankton,” says Robinson, who studies marine ecosystems at the University of Louisiana at Lafayette. In photographs, they look like floating specks of dust, and her research involves quantifying and mapping their distribution and predator–prey interactions. The problem is, she must do so in millions upon millions of images.
Robinson collects data by towing a remote-camera platform called ISIIS — the In Situ Ichthyoplankton Imaging System — behind a boat. ISIIS captures about 80 photos per second, or 288,000 images (660 gigabytes) per hour. For one project in the Straits of Florida, when Robinson was a postdoc, she generated 340 million pictures; a colleague working in the Gulf of Mexico generated billions.
“You start to learn about things that you never thought you would learn,” Robinson says, “like the number of files that you can store on an individual computer. It’s 30 million, by the way, on your regular PC.” On her most recent cruise, Robinson sailed with 52 2-terabyte hard drives, which a student had to monitor and replace as they filled up. Someone then must get that collection to the university, convert the files to Linux formatting, and upload them to a server — a process that takes 24 hours per drive.
The team uses machine-learning software to automatically pick out and identify objects in the images. But the algorithms must be taught what to look for — this is a starfish, that is a prawn. Such features are relatively rare in the water, so finding pictures for the training set takes time. Over two months, Robinson and her team manually sorted through 2 million images to find enough to feed the algorithm. “It’s a little mind-numbing, but if you’re under the gun you can do it,” she says.
Naturally, the team is looking to optimize the process. Working with colleagues at Oregon State University in Corvallis, where she was a postdoc, Robinson is testing whether she could accelerate her work by processing the images on multiple video card graphical processing units (GPUs) running in parallel. She is also looking into cloud computing as an alternative to Earth-bound clusters.
But infrastructure goes only so far; what the team really needs, she says, is more people to crunch the numbers. Unfortunately, data scientists are in high demand, and industry jobs are lucrative. “We have a lot of turnover,” she says.