Jessica Hedge tells Julie Gould how she learned to code as PhD student, and the freedom and flexibility it provides to manage large datasets.
"I never saw myself as a coder and it took me a long time to realise I had to pick up the skills myself," she tells Julie Gould in the second episode of this six-part series about technology and scientific careers. "A colleague was using Python and R and I saw the potential." What is her advice to other early career researchers who are keen to develop coding expertise?
Also, Brian MacNamee, an assistant professor in the school of computer science at University College Dublin, talks about the college's data science course and how it can benefit both humanities and science students.
Finally, Nature technology editor Jeffrey Perkel describes how coding can help with computational reproducibility.
Jess Hedge, Jeff Perkel, and Brian MacNamee join Julie Gould to discuss coding and computational reproducibility
Hello, I’m Julie Gould and this is Working Scientist. This is the second part of our series all about technology, and we’re exploring how the technologies are affecting scientists, research and universities.
In the first episode, we heard all about artificial intelligence from Mark Dodgson and Lee Cronin and how it may or may not change the way that universities operate.
In this episode, we’re going to go back to basics and look at why a skill like computer coding is so vital for any researcher in academia, ranging from the biological sciences to agricultural studies and even to Old English literature.
So, I asked Jeff Perkel, the Nature technology editor, why he thought that being able to code was a useful skill for a scientist.
So much research right now involves manipulating data, and it’s possible to open up your data files in Excel and work with them in a programme that has like a point-and-click interface and manipulate them and there’s nothing wrong with doing that.
But the trick is that if you go through a whole bunch of steps and figure out a way to do something and then you really like it and then you want to do it again because you do the same experiments a month later and you say I want to make that same graph again, that’s hard.
And one of the things that to me makes coding so powerful is that instead of using that point-and-click interface, if you figure out how to do those manipulations by basically writing programming scripts to do them for you, then all you have to do to run them again is change the name of the file that you’re working on. And we had a feature last year on the problem of computational reproducibility, which this really sort of feeds into.
This notion that you can do something once and you can make it work, but if you’re going to be doing it again and again and again, you’re just making life more difficult for yourself if you don’t figure out a way to sort automate that process so all you have to do is say run this script and it’s going to take your data and do the same thing to it.
Brian MacNamee is a lecturer in computer science at University College Dublin and he’s also the Director of the SFI Centre for Research Training in Machine Learning. He believes that because technologies like artificial intelligence and machine learning are becoming more prominent, it’s really important for researchers from all disciplines to learn to code.
The challenge for us again is to think about well how to we redesign and change our training programmes so that we’re equipping people to do the kind of science that they’re going to do in the future. So, to give one example of that, in University College Dublin, we introduced a programme called the Minor in Data Science a couple of years ago.
The idea behind that is the way our structures work is students do a particular major degree but now alongside that they can also do this minor in data science, so they can have a qualification in data science.
Our aim is to push that out all across the university because one of the interesting things around data science, machine learning and data analytics is we can apply these techniques to basically any area ranging from the humanities and study of Old English novels right up to chemistry and astrophysics.
And by having this qualification on the side, one of the places where we’ve seen particular enthusiasm for it is in the wet sciences where the people basically running those degree programmes are recognising that their students are not spending or are not likely to end up spending their time doing, let’s say, lab bench work of the pipetting of solutions into beakers, but the more interesting work they’re doing now is on the far side of that process where they’re looking at what does the data from a big job like that tell us.
So, what those people need are essentially data science skills so how do they collect data, how to they manage and manipulate that data, how do they then analyse that data to allow them to come to well-grounded conclusions and drive their science through those well-rounded conclusions. So, one of the challenges I think for the universities is to basically look across all of our science programmes and say where is data having an impact here and therefore what are the skills that people need now to be a part of that, to maximise that impact and to make sure that we’re training them with those skills.
Why is it so important for people who are interested in working with data, why do they need to learn to code?
Yeah, so that’s a good question – it’s one that we struggle with a little bit and we kind of flip flop back and forth on. I think it’s a really good idea for people to learn to code and I think it’s a really good idea because it gives them the biggest freedom and flexibility. So, now they don’t need anyone else in order to do whatever it is that they want to do, so it puts them right in touch with their data and it puts them right in touch with an enormous range of packages that will allow them to do different kinds of data analysis, work with different kinds of datasets and produce different kinds of outputs from the analysis of those datasets.
What’s interesting is that we’re at the point now where people are saying well, what can we start to do with that data and the sort of, let’s say, easier-to-use tools, then code haven’t quite caught up with that. A lot of those easy-to-use tools are commercial tools they have spent quite a bit of money on. Those tools will then point you in a particular way to analyse data, a particular way to think about data.
The big advantage of code is well, you have the most flexibility in doing whatever it is you want to do with this data. So, if you come up with a new way to look at your problem through data, you can write the code to do that, whereas you might struggle to find a nice GUI-based application that will be able to do that for you already. So I think that’s the big advantage – that you can now unlock any data and you can look at that data in any way that you want once you can start to write some code.
So, really people are now using the code to create their own tools.
Exactly, yeah, and that’s the evolution that we’ll see – that basically as people will think of new ways to look at data, they’ll write code, they’ll build tools, then other people will start to just use those tools as they go along, but we’re at that kind of exciting place now where people are figuring out okay, given this new kind of data, given these new kinds of questions that I want to ask of this data, what is the right way to do that or what’s the tool to build that, and there’s great creativity and excitement about how people come up with new ways to look at data.
And one such person who had to learn to code was Jess Hedge, and until March this year, she was a postdoc working on infectious disease evolution, which involved a lot of genome sequence analysis of bacteria and viruses. Her work meant that she needed to spend a lot of time analysing and managing huge datasets. But before she started out on this type of research, Jess had never done any coding and she was quite reluctant to do it, but needs must and she was thrown in the deep end and had to learn fast to make sure that she could do her work.
So, during my PhD, I was using software that other people were developing, and I suppose my first taste of coding was really trying to understand their code, and I never really saw myself as a coder and it took me a long time. I was very reluctant. It took me a long time to realise that I’m going to have to pick up these skills myself.
And so, because I was trying to understand this software which was written in Java, I kind of went in at the deep end to be honest, and so my first experience of coding was probably not a great one and I was put off.
And it wasn’t until a lot of my colleagues and peers were using different coding languages, thinks like Python and R, that I saw the potential for it to really help my work and, as you say, we’re now in an era of big data and I think no matter what you’re doing in biology and also in the sciences more generally, you’re going to have to be able to wrangle your data and clean it up and being able to code in Python and R and even Bash is a great place to start.
So, how did you teach yourself?
Well, a combination of things really. It wasn’t a particularly well-thought out plan of this is how I’m going to learn how to code. As I say, I started off learning Java which didn’t go very well, but it did get me thinking like a computer which I think is really important for learning how to code. You really have to be explicit in the instructions that you’re giving to the computer.
Can you give an example of the level of explicitness?
You can’t really make any assumptions, so even if you’re doing something as simple as renaming 200 files, so if you’ve missed something and haven’t specified the exact folder in which you want to rename all your files, you might end up renaming everything in that whole series of folder on your computer system.
The computer won’t guess that you want to do it just for a specific directory, so you’d have to be really clear.
So, when you were learning how to code did you go to online resources or did you ask friends for support, or colleagues and peers for their help?
Yeah, I suppose when I really got stuck in was just after a course I did, a kind of introductory course to Python, which was actually run in the institution that I was working in by another colleague, so a more senior postdoc or a lecturer I think.
And this was quite an intensive course – it was two weeks – but it was with peers who I could then approach after the course as well, which I think is really important that you have a kind of buddy that you can go to and ask for help, just a second pair of eyes on your code.
Although it was two weeks, it was really just picking up four or five basic things, so for loops, if loops, lists, just these dictionaries, these structures which are the basics, and once I had them under my belt I was off really, and that’s all it took really, and I think one of the reasons that that course was such a success for me was that I was able to bring along a project with me.
So, I was already working on a problem that I wanted to address using some kind of coding or programming language but I didn’t know how to do it. So, taking along that project and that dataset meant that I had lots of different pairs of eyes looking at it with me and I could work out how is this coding language going to be of relevance to me.
So, you’ve mentioned a few different languages already – you’ve mentioned Java, Python, R – I imagine there are lots of computer coding languages. How do you choose which one’s the best one to start with?
There are some languages which are better for software development, but for me I really wanted something that would help me clean up big datasets, even renaming things in files rather than trying to do this by hand in Excel, and so that’s why I started with Python – I do a lot of my data wrangling in that. And then I learnt R because it’s designed for doing statistical analyses but it’s also brilliant at plotting results onto graphs and figures and doing a lot of data visualisation.
And then Bash, I use that basically for stringing together command line tools, so bits of software that you can run from the command line that can basically do the analysis from start to finish without me having to intervene, which means that the analysis can run pretty much overnight or during the day while I’m getting on with something else. So, all of these tools help me to be more efficient as a researcher.
Which is ultimately the goal of all this software development and coding.
Earlier this year you left academic research and you moved to a staff development role at the Nuffield Department for Clinical Neuroscience in Oxford.
Now, you spent a lot of time teaching yourself coding and working hard to get all the tools that you needed to do your research, so you’ve got this huge skill base and this knowledge. Do you think you’ll be able to apply any of that knowledge in your new role?
Definitely, so my new role is going to be analysing a lot of data about staff in our department and I think basically, having these skills and this confidence to be able to analyse any dataset with this set of tools is a real benefit. I now feel comfortable with a dataset of any size and any type of kind of messy data because I know that I have the skills from my coding experience to be able to get that data in a format in which I can analyse it properly and also visualise the results which is arguably just as important.
Do you have any advice for anyone who is an early career researcher who might be facing some challenges within their research where they might need some coding skills, coding advice?
From my experience, one of the most important things would be to try and find a course to go on, some kind of workshop. It might not work straight away – I had to go to a couple to get to that mindset – and also to find the right language for the sort of work that you’re doing as well.
And perhaps try and find a mentor or a buddy, someone within your department or group or wider university that you can kind of check in with occasionally and check that you’re doing things right.
Don’t be worried if you spend all your time googling errors and bugs because that’s all part of the learning process I think for coding. There’s so much out there already and I can guarantee that there’s probably an answer out there to every question that you come up with at this stage of coding anyway.
Thanks to Brian MacNamee from University College Dublin and Jess Hedge from the Nuffield Department of Clinical Neuroscience in Oxford. Now, next week we’ll look a little more at coding and Jeff Perkel, the Nature technology editor, will have a chat with Harriet Alexander who has run a coding course in Antarctica.
Access to the internet is significantly slower and more limited down in Antarctica than it is at a university where I would have taught back up in the States, so the primary problem that I ran into was being able to access the course materials and trying to download the software programmes that we needed for people to be able to run these materials on their own machines, which is a primary goal of Software Carpentry.
And we’ll also hear from Simon Hettrick who is the Deputy Director of the Software Sustainability Institute and the founding chair of the Research Software Engineers Association.
Thanks for listening,
I’m Julie Gould.