This page has been archived and is no longer updated

 
September 07, 2011 | By:  Nick Morris
Aa Aa Aa

SOLO11: Science online London 2011 (#solo11) - day 2, panel 1 - "Dealing with Data"

The second and final day of Science online London 2011 kicked off with a panel session on "Dealing with Data". The conference information described the session as:

"This panel will look at how data is transforming and affecting scientific research and communication, from data visualisation in the media to data-intensive science in various fields. We'll hear from leading scientists, media organisations such as The Guardian, and experts in infrastructure about their take on working with data in the digital age and how the way we interact with information is changing.

Moderated by Kaitlin Thaney - Digital Science

Panelists:

  • Tim Hubbard - Wellcome Trust Sanger Institute
  • Alastair Dant - Lead Interactive Technologist, The Guardian
  • Kristi Holmes - VIVO
  • Kaitlin Thaney - Digital Science"

How are people interacting with data?

This was the question that really kicked off the session.

Tim highlighted three key areas he was concerned about with data at the Wellcome Trust Sanger Institute: Data growth, data sharing, and data security.

Growth - The Sanger currently has 12 petabytes of data* and as genome sequencing is getting faster and cheaper (cost dropping by order of magnitude every 2 years) the Sanger is seeing the data they hold doubling every 6 months. Hence there is now a serious need for data compression so that they are maximising the storage space they have available. At the Sanger they are almost spending more on data storage and handling than on reagents to do the experiments!

Sharing - The high amounts of data being produced also need annotating with the appropriate meta-data so as to make sharing meaningful. At the Sanger you can't now start a project until you have an accession number for the data, and you can't get an accession number without completing the necessary meta-data for the experiment. This seems like a great way to get the data properly annotated so that the data can be easily shared between researchers.

Security - This is becoming more relevant as the centre is now collecting more human genome sequences. So, how do you let researchers have access but protect the privacy of the patient? Maybe the answer, as Tim suggested, is to upload the query to Sanger so that researchers don't see, or have access to, all of the raw data. The research can then be done, and the privacy of the individual is protected.

Tim also said that one interesting consequence of large data projects being online is that, and for example this happened during the human genome project, the funding bodies can track the data in 'real-time', and therefore check if the project was on track. Basically, the researchers can no longer hide what they were doing (or, what they were not doing!) and research can be tracked in almost real-time.

Alastair gave a great presentation titled "Humble Pie - charting the stories in big data". The presentation kicked off with a quote from Edward Tufte "Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space" - and this was the general thrust of the presentation - that is, let the data tell the story through the use of graphics and interactive graphics. Essentially, you can either just give people a passive experience, i.e. they just read the story, or you can also allow the reader to drill-down and explore the data behind the story, and possibly come to their own conclusions. However, Alastair did admit to 'smoothing' some data so that the data told a slightly more obvious story!

Kristi talked about the VIVO project which is "enabling national networking for scientists", she did say it should be 'international' and not 'national', and that they should add the tag line "and facilitating scholarly discovery". Kristi talked about how VIVO is an open source, semantic web project, which is basically building a database of scientists through the data-mining of a number of sources. The idea is to spot connections between researchers that may not be immediately obvious. For more information see - http://sourceforge.net/p/vivo/home/VIVO/, and http://vivo.library.cornell.edu/.

The question section also provided some interesting points.

Peter Murray-Rust asked how VIVO managed to get the data for whom interacts with whom. Kristi replied that the project is very open, and will take data from anywhere, and that some institutions were pulling data from PubMed, and others were using Web of Science, Scopus etc. via their APIs. Peter expressed surprise that Web of Science, and Scopus allowed such access, and Kristi admitted that they were still looking in to things, and that it was all at an early stages.

The questions also touched on the subject of data storage.

Rosie Redfield (she was on the first pane of SOLO11, which discussed the arsenic story) asked if genome data would ever become cheaper to generate than to store. To this Tim responded that he was not sure, but doubted that that would become the case, particularly when you consider the need for rapid retrieval, and re-generating the data would introduce an unacceptable time lag. There was also the risk of the loss of historical data, and therefore the ability to follow changes over an extended period of time.

Phil Lord (Newcastle) asked about the Lossy-compression that Tim had mention, and asked what data was going to be throw away. Tim replied that the idea was to make data a thing that points at the real data, and not the data itself, that is, there is no need to store all the data. For example, most humans will have huge stretched of DNA that have exactly the same sequence as in other humans, therefore all you need to do is store the reference sequence once and then 'point' at it as the data. This is essentially the same idea as normalising databases.

The final question from the session was 'Why are data visualisation in science boring?'. To which Tim replied "because they tell the truth?", and Alastair thought that answer may in fact be correct! Does that mean we should always view interesting 'non-boring' representations of data as possibly not the truth?

Summary

Data-sceintsist/wranglers is becoming more important as we need to understand and interpret the data.

(* 1 byte is typically made of 8 bits, where a bit is a 1 or a 0. 1 kb is 1000 bytes, or 1 x 103 bytes. 1 MB is 106 bytes, and 1 GB is 109 bytes. 1 petabytes (PB) is 1015 bytes. The laptop I am current using has a 500 GB hard drive, or 500 x 109 bytes. Therefore, to store the same amount of data as the Sanger I would need 12 x 1015 / 0.5 x 1012 which is 24,000 laptops)

0 Comment
Blogger Profiles
Recent Posts

« Prev Next »

Connect
Connect Send a message

Scitable by Nature Education Nature Education Home Learn More About Faculty Page Students Page Feedback



Blogs