Julie Gould:
Hello, I’m Julie Gould and this is Working Scientist, a Nature Careers podcast. In this fourth episode of our technology series, I really want to look at data management and reproducibility, as the two are so intricately linked. And when I went to Dublin in March earlier this year, I met with Brian MacNamee. He is a lecturer at in computer science at University College Dublin, and is the Director for the SFI Centre for Research Training in Machine Learning. Now, we actually heard from Brian earlier in the series when he talked about how important it is to learn to code. Now, he’s actually a lecturer on a minor degree for computer science at UCD, and data management is a core topic of this course. And it’s not hard to see why because being good at managing your data is such a key skill to have, especially in modern scientific projects where the datasets can be so large that they often break Excel. Now, he said machine learning is actually becoming a very useful tool for scientists in data collection and management, and he gave an example of the large sky-scanning telescopes used by astrophysicists.
Brian MacNamee:
And one of the problems with the large sky-scanning telescope projects is – although it’s brilliant – they’re generating colossal amounts of data, so scanning large portions of the sky. There’s new ones coming up all the time and they’re generating more and more data. For scientists to be able to use that data, that data needs to be categorised. So, as an astrophysicist, what they want to do is be able to go to the large collections of data that have been generated by these telescopes and look for the bits that they’re interested in. Now, categorising the volume of data that’s there isn’t possible to do by hand, so one of the interesting things that people are doing is putting machine learning systems into those pipelines. The job of the machine learning system is basically to categorise and prepare the data for the people to do their science further down the line, and I think that’s a really interesting way where machine learning is being used and it’s not where machine learning is the end result in its own right, it’s where machine learning is sitting in part of a pipeline to allow scientists basically to use the data that we can gather now in a much better way. So, now the scientist can say I want all the examples our telescope has captured of a particular kind of galaxy with particular degrees of redshift, and they’ll get suddenly a nice trawled-through database and have a thousand examples of exactly what you’re looking for, and they don’t have to look through the hundred thousand examples of all the things that they’re not looking for. I think it’s a really interesting use of machine learning that we’re seeing as basically a tool for scientists to use as part of their pipeline and part of their process.
Julie Gould:
Why is it important to learn to manage data well as a researcher?
Brian MacNamee:
I think there’s two ways to think about that. One is that data is massively valuable. So, once we do anything to collect data, those data resources, in their own right, are really, really interesting. So, by manging those data resources properly, we can reuse them again and again. We can also then share them. A good dataset is a great citation attractor – if you put a really interesting dataset into the world, that’s a great way to build your profile as a researcher and to build your profile within a domain. So, there’s real value in those datasets, and not just in the analysis and the science that we do with those datasets, those datasets in their own right because the dataset that you collect can spread out into the world and be used in ways that maybe you’d never even imagined, so your dataset can be the seed for that. So, it’s really important, from that point of view, that you’re very careful about how you manage the collection of that data, how you manage the integrity of that data and the cleanliness of that data, that you don’t allow errors to creep into it or you don’t mix up time periods in those datasets. So, being able to do that is really, really important. The other thing that I think is key that we train our researchers to do is to think about data from a reproducibility point of view and manage the data properly from a reproducibility point of view. So, more and more in data science and computer science, having papers with accompanying data and accompanying code is becoming crucial, and there’s certain venues where that’s mandatory but more and more venues are encouraging that. So, where you can say, well, here’s my paper and the findings that I have in this paper are based on this dataset and here’s the code for the analysis that I ran on that dataset and that drives everything that you see in this paper. That’s brilliant because it means that that uncertainty about reproducibility disappears to a large extent, so other researchers are then able to run exactly your experiments, they’re able to understand exactly what you’ve done, then they’re able to tweak and change that and see what happens and build on top of that research with that dataset. So, researchers being able to, I think, one, see data as valuable and then have the tools, techniques and the skills to manage that data so that they can make it available to the world, they can furnish it properly, they can clean it properly, they can make it clear to other researchers what they’ve done in order to get the data from maybe the raw state in which it was collected to the state where they wanted to start analysing it, and that really helps them. It also makes their research much more transparent, brings much more trust to that research and makes, I think, it much easier for other researchers to build on top of it so those other researchers are not reading a paper and trying to figure out where are the gremlins inside this dataset – they can open it up and find them themselves.
Julie Gould:
Now, you’ve just heard Brian talk about the importance of reproducibility there, and when he mentioned this, I thought back to an article in Nature in 2016 called ‘Scientists lift the lid on reproducibility’. The piece highlights the results from a survey run by Nature about how scientists feel about reproducibility in science, and the results weren’t positive. To quote: ‘More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments’, wrote Monya Baker in her opening sentence. Jake Schofield is the CEO and founder of Labstep. Now, this is a new tool for scientists that allows them to record their work accurately, in real-time and then share it when they’re ready to. Having an accurate record strikes me as one way to reduce the level of the reproducibility crisis, so I asked Jake what some of the pain points were that he had experienced with traditional lab notebooks, whether electronic or paper, and how this led to Labstep and how he’s hoping it will impact reproducibility.
Jake Schofield:
They all had the same problem which was that if you’re at that lab bench 8am through till 10pm doing a lot of labour-intensive experimental research, writing up all of your results, the last thing you’re going to want to do is spend hours typing up a diary of what you were doing. And so, we kind of tried to take this very different approach which was can we pull together, okay, well this is the protocol that I’m executing, this is the sample that I’m using, this is how I’m deviating, this is the data that’s directily from that microscope or that device in the lab, and then this is the script that I’m writing in Python, the parameters, and then here is the kind of output file, and have all of that automatically captured so that you save time.
Julie Gould:
Let’s use an example: we have someone who is working at a lab bench, for example in a genomics lab, and they are familiar with electronic lab books, they’ve also got their paper notebooks but they’re finding it very frustrating to keep track of everything they’re doing. So, how would something like Labstep make it much easier for them to keep track of everything and to automate a lot of the data collection and, I guess, the diary or notetaking that a typical scientist needs to do.
Jake Schofield:
It will deliver a fundamentally different experience. So, they would go into the fume hood, they’d start their experiment but they’d be able to pull that up on a tablet or mobile device, they’d be able to walk through step by step that kind of interactive procedure, that protocol, that SOP. At any point where they’re deviating, they can click and change that sample and it will tie to their inventory. They can take photos of experimental setups. We will automatically capture if you leave that in the water bath for an additional five minutes because the timer has overrun. And then we pair that progress data as to what version of what experimental procedure this user at this timestamp has carried out and we tie that to the datasets that are being produced, which means that you’ll have this kind of record created for you automatically where Jake did this version of this experimental procedure. Here is how he deviated, here’s the dataset produced from that sequencing device and then from there I can share that record with my PI or I can go and carry out my computational analysis on that DNA sequencing dataset. I can go and do some bioinformatics and any time I’m running that bioinformatics script, that will be synced and mirrored in Labstep. So, I can still use my same analytical environment – I don’t have to log in to the Labstep platform – but every time I execute that script it will then mirror it and create this record in this environment in Labstep. So, it will say Jake used this dataset, and here were the parameters, here is the script, here’s the input file, here’s the output file, so you end up with this great step-by-step audit trail of everything that I’ve done.
Julie Gould:
As I’m sure you’re aware, there’s what has been coined the reproducibility crisis, and in 2016, Nature published an article which shed light on this crisis. But how can Labstep make sure that these sorts of things don’t happen and that reproducibility becomes something that is no longer in crisis?
Jake Schofield:
Being able to very easily have this library of version control protocol, so that as you’re tweaking and changing and optimising those SOPs, that’s all tracked like GitHub would with code. It’s then, when you’re at the lab bench, making it as easy as possible for you to singly mark okay, well I used that batch of that antibody or I changed the volume here or I left it in the water bath for longer, and making that streamlined process of being able, with a single click, to mark those points of deviations, those are the things that at the end of a long day you don’t necessarily write up. And tying those experimental deviations to the samples that are being used, to the inventory, to the specific batch numbers, all of that in a packaged-up way means that it’s going to fundamentally change the way in which people can reproduce findings. It’s going to fundamentally change the way in which scientists go about debugging and working out where things went wrong.
Julie Gould:
Jake, you went into setting up Labstep with zero experience of setting up a business. How did you do it with having zero history in that sort of space?
Jake Schofield:
We started and we had no experience, and that’s a big part of why this whole journey has been so fun, that every chapter, every step that we’ve taken, it’s been a huge learning experience. So, we started with let’s just try and build something that just meets the requirements and then once we had a basic prototype, it was okay, well, now let’s go and try and talk some professors at Oxford into using this. And we managed to get some people from the DPAG – the Department for Physiology, Anatomy and Genetics – to start using it as a teaching tool, and then that meant that we had groups of 50 undergrad students all running the same version of the same protocol all at once, but it was this great validation to see, okay, well actually we’ve got some traction here. Actually, there is something that we could do and it was this great testing bed where we suddenly started to see okay, well this is how we need to build and improve the product. It then came to this point where we’re probably going to need to go and raise investment, and that was something that was completely new to me. Starting out, you have no idea that there are even these kinds of stages of investment from pre-seed, to seed, to series A. There are different types of investors, from angel investors to VCs to high net worths, and how do you navigate that whole minefield. We’ve been very lucky in that we’ve had some very high-profile mentors and some people that have really guided us along the way.
Julie Gould:
One of the things that we are really keen to support at Nature Careers is this concept of mentors. You mentioned yourself that you had some very useful mentors. How did you find them and what were some of the best pieces of advice that they gave you?
Jake Schofield:
The mentors that we had, it was Seedcamp. It was then, when we got a little bit more success, we got on the Google Residency Program, and we got amazing advice from this Google network. The advice that they gave us was all very specific and all very task-orientated, and I think that that’s what’s kind of key. Sometimes you can get lost and daunted by the very big picture – okay, this is far too big a problem and it’s far too big for us to possibly solve – but actually breaking It down into what is the next key hurdle that we need to go and solve, what advice do we need to go and do that, and then just tackling that problem. I think, also, scientists have limited exposure to entrepreneurship. You think that the only pathway is, if you want to remain in science, okay, well I need to go and do a PhD, I need to go and do a postdoc, I then want to go and fight for those very limited spaces in academia. Actually, I kind of wish that there was a lot more exposure and a lot more opportunity for people to go around and try some of these entrepreneurial pursuits because you’ll realise that actually, that scientific and analytical problem-based thinking has so many parallels in the world of startups. This is why we as a company have been very successful – is that we haven’t necessarily been a whole host of experienced entrepreneurs, but we’re a whole host of very passionate scientists and we apply a lot of the learning and skills to actually how we go about carrying out this process of building a company and scaling a business.
Julie Gould:
Thank you to Jake Schofield and Brian Mac Namee. Now, in the penultimate episode of this series, we’re going to look at another piece of software that has changed the way scientists are working. I speak to Ben Britton from Imperial College London about how Slack has changed the way he manages his team.
Ben Britton:
Monday morning, there is a bot, so just an automated system called Howdy that you can enable for your Slack team, and Howdy will work on my behalf and ask three questions to my research group: what are you planning to do this week, is anything holding up your work and would you like a face-to-face meeting. And it’s a very nice way for me to triage from my perspective of management who do I really need to see.
Julie Gould:
Thanks for listening. I’m Julie Gould.