Reproducibility as non-negotiable Artwork

Cracking the Cancer Code

In an age where data are everywhere, harnessing the power of data science can be a catalyst for groundbreaking discoveries in the fight against cancer. Welcome to the Cracking the Cancer Code podcast where we explore the latest in cancer data science. As a part of the ITCR Training Network (itcrtraining.org), we’re a small team of individuals who are working to democratize data science education in the hopes of catalyzing cancer research and ultimately fighting health inequities in cancer.

The ITCR Training Network (and this podcast) is supported by NCI UE5CA254170 but the views expressed on this podcast are those of the individuals who expressed them and do not reflect the views of our funders.

Find out more about the ITCR Training Network at https://www.itcrtraining.org/

All Episodes

Cracking the Cancer Code

Reproducibility as non-negotiable

November 18, 2024 • ITCR Training Network • Season 1 • Episode 5

0:00 | 25:52

Send us Fan Mail

This podcast episode discusses the importance of reproducibility in modern cancer research and data science. Through interviews with experts like Dr. Roger Peng, Dr. Casey Greene, and Dr. Jaclyn Taroni, it explores how making research reproducible - by sharing code, data, and detailed documentation - not only helps verify scientific findings but also makes research more efficient and builds a stronger foundation for future discoveries in cancer treatment. The experts emphasize that while implementing reproducible practices requires upfront effort, it ultimately saves time and protects research integrity.

0:07: You're listening to Cracking the cancer code, a podcast series about the researchers who use data to fight cancer.
0:13: I'm Dr Carrie Wright, senior staff scientist at the Fred Hutchinson Cancer Center.
0:17: I'm head of content development for the I T C R Training Network.
0:20: A collaborative effort that's funded by the National Cancer Institute of researchers around the United States aimed at supporting cancer informatics and data science training.
0:29: I'm Candice Savonen.
0:30: I'm a data scientist at Fred Hutchinson Cancer Center.
0:33: I'm a tech lead for the I T C R training Network and we work closely with a variety of dedicated cancer researchers who are on the forefront of cancer informatics shaping the field's future.
0:44: Last episode, we discussed how everyone in biomedical science works with data in some shape or form.
0:50: There's a variety of different types of data but often today, many of these data sets are quite large and this has really changed the shape of the field and made it ever more important that we have tools to perform data analysis, to use that large set of data that we now have available.
1:08: Yeah, it is not only important that people know how to use computers but that they know how to use data analysis tools in a researcher's toolkit, it's not just test tubes anymore.
1:19: They also need to know how to do some programming.
1:24: So science is a community effort to cumulatively build knowledge, at least when it's done, right.
1:30: When we learn something new by the scientific method, we build on that by our next pieces of work.
1:37: But this means that the foundational blocks of particular field.
1:40: So in our case, cancer research need to be solid.
1:44: That way when we build the next blocks, everything kind of continues to help and inform we can help treat cancer sometimes in life, we are just lucky and sometimes this can happen in science too.
1:57: But in science, we talk about this in terms of probability.
2:00: So sometimes you can get results and it just happened to be that all the stars aligned.
2:05: But if you make your experiments and your data analysis robust, it will make it extremely unlikely that that is the case.
2:14: And that is a little bit on what reproducibility and the reproducibility crisis that people have been talking about in science is built on in order to make sure that all of the money and the time and the effort and the blood and the sweat and the tears that we're putting into science is getting us closer and closer to what is true about cancer.
2:33: We need to make sure that the methods that we are using are not only documented very well, but that other people can take those methods and they can understand what's Yeah, I really agree.
2:44: Candace, it's super important for us to be really clear about what it is that we're doing with data from the moment that we start with it so that people can better understand how we've done our data analysis and they can build upon them if we share our code and share all the details of what went into our data analysis.
3:03: Other people can try to recreate that to see if we did everything correctly, if there's any nuances, in terms of them trying to reproduce the results themselves, so that we can better understand what's really going on with cancer.
3:17: So as cancer research or indeed, research in general becomes more data intensive and coding driven, increasing complexity means more decisions need to be made about the data science process along the way to ensure rigorous research that leads to well supported results.
3:34: We can be confident in.
3:35: We need to ensure that the data science part is not only clear and transparent but publicly available as much as possible while protecting patients, let's talk about what rigorous and reproducible science looks like in the age of informatics.
3:48: To begin our discussion, we talked to Doctor Roger Peng who many of people know as one of the experts on reproducibility in data science.
4:01: So I'm Roger Peng.
4:02: I'm a Professor of Statistics and Data Sciences at the University of Texas at Austin.
4:07: Roger happens to be a friend of ours and colleague.
4:10: We've collaborated with Roger on several projects related to reproducibility.
4:15: We used to work with him at the Johns Hopkins Data Science Lab for a lot of people in science.
4:20: They mostly hear about reproducibility in relation to a possible reproducibility crisis.
4:26: We asked Roger what people mean when they say there's a reproducibility crisis in science?
4:31: Well, so first I have to ask you a question right back, which is what do you mean by reproducibility?
4:37: There are a lot of words out there and they kind of have various meanings.
4:40: And so people often talk about reproducibility crisis, the replication crisis, the whatever crisis.
4:47: And I think they're all related in some way, right?
4:49: And so I think when you say reproducibility crisis, I think what you mean is the kind of like the lack of consistency or, or I guess consistency of scientific findings across different studies.
5:00: So for example, one study shows that I don't know stretching can improve heart disease or something like that.
5:05: And then another study like a bigger study maybe tries to ask the same question and doesn't find anything, right?
5:10: So that's an example of like a replication, what I would call a replication failure in terms of reproducibility.
5:16: The way I define reproducibility is more in the computational sense.
5:19: So can I take the data that you analyzed and the approach that you took in the software that you used and kind of produce the same results, the same numerical results, which is very different from say, trying to independently replicate a separate study.
5:34: In essence, we're starting with repeatability as the very bottom, which is the foundation.
5:38: That means that with the same researchers, same code, same data, we can get the same results, we can't move on to reproducibility until we prove to ourselves that repeatability is evident when we get to reproducibility, that means we're still using the same code and the same data.
5:55: But now we might have shared this data and code with another researcher.
5:59: Once reproducibility feels fair, be verified and robust, we can move on to replicable, which is where we might have different data, but we will be using similar code and it may or may not be a new researcher.
6:12: But the idea is that we've moved on to another data set that has a similar principle.
6:18: So Roger has been thinking about reproducibility and advocating for reproducible science practices for quite a while.
6:24: Hear what he has to say.
6:26: I was working in the field of epidemiology and there was kind of a shift in epidemiology, especially in environmental epidemiology to moving towards kind of more computational research.
6:36: And this was like a big change.
6:38: And because we have lots of data and large data sets, you know, from administrative data sets, kind of publicly collected data and there was a lot of interest in kind of analyzing that data to answer environmental health questions.
6:48: But there wasn't really a lot of experience, kind of institutional experience in the field for kind of doing this on computers, you know, using large computations, complex statistical models.
6:58: And so there weren't any like kind of best practices or early guidance on how to do this.
7:03: So a lot of a lot of times people would just publish research and say here are the results and there would be very few details about how the analysis was done, what software was used, you know, none of that kind of stuff.
7:12: Nothing, nothing like what you might see today in a typical paper, a typical computational paper.
7:17: So I and two other of my colleagues wrote a paper about reproducibility in epidemiology, algae trying to kind of lay out a couple of just basic what we call the minimum standard for reproducing an epidemiologic study.
7:29: So clearly you needed the data that was used and, and the software, the code that was used to analyze the data.
7:36: You know, I think nowadays you might go a lot further than that.
7:38: But at the time, this was like in 2006, this was like, let's just get some basics out there, right?
7:43: And, and, and it was extremely controversial.
7:45: The idea that you just make your data available to other people, to strangers to analyze was just like it was you know, it was ludicrous.
7:52: The problem is that previously and actually still a lot today, the way that we advance our careers in science is we publish papers and we often don't want to get what's called scooped, meaning we don't want to publish information about our research.
8:07: And someone takes that information and then they publish a paper about it.
8:11: That's how we get promotions, that's how we get new jobs.
8:16: The idea that sharing data software and code has actually been kind of controversial topic historically and that's been changing a lot over time now.
8:26: I think, you know, almost 20 years later, it's like it's still a little bit.
8:30: I think people are a little tense about that, but there's generally a sense that like that's the right thing to do.
8:34: It's just a question of whether you're going to do it, right.
8:36: But it took a lot of time.
8:37: But I think most people felt like, well, replication is the ultimate standard.
8:40: Why bother with this reproducibility stuff because we're just gonna redo the whole study anyway.
8:44: And I think ultimately that is true, maybe over a 20 year, 10 year time span.
8:49: But a lot of times decisions have to be made right now, especially in kind of regulatory types of environments.
8:54: And I think we have to just go with the evidence that we have right now.
8:57: And so reproducibility, we argued it was like kind of a minimum standard is like, look we can't redo the whole study, but at least we can check it and make sure that the evidence is strong or at least as strong as it is claimed to be by redoing the analysis.
9:09: One of the barriers to pushing systemic changes is fighting against kind of what people are used to or what has been.
9:18: And if the culture is not really pushing people in a positive way or incentivizing them to participate, it can be really difficult for even the most well intentioned researchers to participate in reproducibility.
9:31: Safeguards.
9:32: I published studies where there have been software bugs and you know, you try your best to kind of weed them out before they hit publication, but it can be hard sometimes, especially as the analysis grow more complex.
9:41: We're building complex systems here.
9:43: And and anyone who's ever built a complex system in any environment knows that the errors are going to occur and problems are going to happen.
9:50: And I think one of the things that we need to do as a scientific community is just being a little bit more accepting of that reality.
9:57: The main benefit of publishing something that's reproducible or at least with the intent of being reproducible is that if there is a problem, we can check it out, right?
10:06: Just like with any software, you know, what's the first rule of software debugging, you know, create a reproducible.
10:11: And so there's two things that we get out of publishing research with the intent of reproducibility.
10:16: One is that if something goes wrong, we can check it out.
10:19: And the second is that even if nothing is going wrong, we can learn from it.
10:22: Right?
10:22: So what are the tools for learning about data analysis?
10:25: There aren't that many actually because usually when we publish the data analysis, there's no details, right?
10:29: There's just like a paragraph in the method section.
10:32: So I think if we have these kinds of details, right?
10:34: It's a way that we can use to learn about.
10:36: Well, how is analysis done?
10:37: How could it be done differently?
10:39: Things like that.
10:40: And so I think that's another important aspect.
10:42: Despite some of the past, Stigma Roger is hopeful about the changes he's seeing in the culture around publishing code and sharing data.
10:49: I mean, one of the things I like to highlight is that a lot has changed over time, I think for the better in a variety of, I think many feels these ideas of like publishing your data making things reproducible has spread quite a bit and a lot of journals now require it.
11:00: So I think there's been a lot of progress made, at least in my time I've seen.
11:03: And so I find that very encouraging the benefit largely is like it's kind of like a common good, right?
11:09: It's producing a common good.
11:11: So if you produce something that's reproducible and someone else produces something that's reproducible, then you both benefit both from methods being available, data being available.
11:20: And so there's just kind of like a multiplicative type of effect, right?
11:23: And if two people do something, then you get more than twice the benefit.
11:25: I think.
11:26: So, I think it's a little hard because it's not like there's an immediate reward for kind of publishing something that's reproducible and you just have to kind of do it on faith in some sense most of the time now when people do it, it's because someone told them to or there was some requirement somewhere, the field has been changing as well in terms of how we interpret the productivity of researchers.
11:47: So now it's not only just about producing papers, but it can also be about releasing data sets and creating scientific software for other people to analyze their own data.
12:00: At the end of the day, we had to figure out as a research community, how we can make doing the right thing.
12:08: Also the easy thing, right, incentivizing people to share their code and to be experiencing promotions and benefits to their career by doing these good practices, we can help encourage and nurture a field and culture that appreciates these things.
12:32: So we've talked a lot about why reproducibility, repeatability, replicable, why all these practices matter and are important.
12:41: But how do we actually do these practices to be more specific?
12:45: We've talked to two cancer informatics researchers who have implemented a culture of reproducible analysis into the lab.
12:52: And we asked them, first of all, why did you do it and how did you do it?
12:55: Turns out that these two researchers are actually folks who have taught me a lot about what I know both Dr Casey Greene and Dr Jackie Taroni at various points, worked for the childhood cancer data lab, which is where I learned a lot about reproducibility and the skill sets that are involved in making science reproducible.
13:13: Both of them kind of taught me what I know.
13:15: Dr Casey Greene has always had a really great philosophy on reproducible science and has taught it to his grad students and postdocs of which Toone was one and she's continued on that sort of teaching legacy of making sure that folks are doing really well thought out transparent publicly available science.
13:37: I respect them both a lot for the work that they do.
13:40: They both are really good at reproducible science and the skill sets that are involved and they have not only made a career of it themselves, but they've tried to help other researchers make their work more reproducible.
13:56: I had Casey Greene and I chair of the Department of Biomedical Informatics at the University of Colorado School of Medicine.
14:03: I don't know, I am someone who has said some things on the topics that occasionally people decide they want to listen to and usually they decide they don't maybe not that perspective wise I have a research lab.
14:13: I'm also currently the interim director of our Center for Personalized Medicine where we use genetics to guide care for patients.
14:19: I'm a computational biologist.
14:20: So that means we use the tools of computing to study biological systems.
14:26: One of the systems that we've worked on quite a lot is cancer.
14:29: In particular, we've done work focused on high grade serous ovarian cancer as well as pediatric and rare cancers.
14:36: So our work, I usually say if you combine public data, the transcript dome and machine learning, we're probably interested, Casey doesn't just see reproducible practices as a way to verify the results and the analyses that his lab does.
14:49: He also sees them as a way to make cancer informatics research more efficient and faster.
14:53: Well, I think the process of science and the process of research is, you know, if you find yourself going in circles, that's a tough place to be.
15:01: And in computational biology, you can find yourself going in circles a lot because often there are patterns in data that we may not fully understand a set of them are meaningful and a set of them are not meaningful.
15:12: If you aren't careful about how you do your research and how you document your research, you can end up in the same place again and not realize it.
15:19: That's not a good place to be.
15:20: And so, you know, I think from my, I was a student through being a postdoc through starting my own lab, I think part of the way to try to be efficient is to just make sure that if you get to the same place, you know, you're in the same place, you know, why you're in the same place and then you can figure out how to get out of the place, you know, if you're not learning what you hope to learn.
15:37: So, I guess, I don't know, I think it's been a long process by which this has become apparent to me that it's critically important reproducibility not only affects our sharing our data with others, but sometimes the data and the code we share with others is us in the future.
15:52: There's a quote that's floated around, around a lot of science spaces that's been attributed to Karl Broman, but also is from Mark Holder, which is that your collaborator is actually you in the past, but you don't answer emails.
16:06: That is a really cheeky way of, of saying that like it's important to document your process.
16:13: So that way in the future, when you forget what today, you knew the future, you can look back and understand what the work today you was doing.
16:21: Casey's lab has a policy of requiring reproducible practices which came into place after some bad timing and unfortunate luck.
16:29: Not unlike some unfortunate luck that many other researchers have experienced in their work.
16:34: There are some things in our lab that are kind of negotiable.
16:37: And there are some things in our lab that are non negotiable.
16:39: So I would say a moment at which it became clear to me that this was really a nonnegotiable issue was there was a moment where a lab member was out of town for an extended duration and did not have access to their computer.
16:53: And yet they had a manuscript that was out for re review and the reviews came back and there was kind of a time window on getting the revisions done.
17:02: And that time window overlapped with when they were out of town and their source code was not available in a system that I had access to, nor that they had access to.
17:11: And yet there was a minor reviewer request that asked us to check one thing.
17:15: So at one point, there was a period where over about a week as a, you know, as a as I had the opportunity to go from their method section to a reim implementation of everything that they had done so that we could do this one little experiment that a reviewer asked for, get this paper back out.
17:32: And so I had the opportunity to reim implement their entire software by glide based on their method section, which actually two to their credit.
17:40: I wasn't so reproducibility.
17:42: Yes, very important.
17:43: And those are just as important as having a good process.
17:46: And so I think that's where I kind of started putting more attention on?
17:49: OK.
17:49: It's not just about like what you write and sort of following checklists around kind of authoring if really about getting the artifacts, right?
17:57: Because there were, you know, as I'm doing this reim implementation, there were moments where I had to make decisions, like, do I use the default perimeters?
18:04: Are these parameters changed?
18:05: My assumption was that if I used them because maybe the parameters change from different versions.
18:10: So I had to kind of make some assumptions about what the person had done.
18:13: And I think if we'd have the artifacts in place, we wouldn't have had to do that.
18:16: I have a personal example too of where some of these practices have really saved me.
18:21: So when I was a post doc, I had a situation where I got an email on the way to work saying that there had been a pipe leak and there was a lot of damage, but it appeared to be mostly isolated to my cubicle.
18:37: And then I had to do the rest of my commute knowing that there was this problem and I was, you know, horrified and concerned and indeed my computer was destroyed.
18:48: The pipe bursting happened right above my desk and the way the cubicle was shaped, the water ricocheted into my monitor slash computer.
18:58: I had one of those computers that's basically embedded into the monitor.
19:01: Luckily I had been putting most of my code and other things on github at our internal server.
19:08: But otherwise I would have been in really bad shape.
19:12: I would have lost my work and that would have been really devastating years of work potentially.
19:19: But because Carrie knew some of these reproducibility skills, she saved herself a lot of time and stress.
19:26: Yeah, definitely worth it.
19:32: My name is Jaclyn Taroni
19:33: I am the director of the childhood cancer data lab, which is a program of Alex's Lemonade Stand Foundation, a pediatric cancer research funder.
19:42: And we do things like try to make data easily accessible for pediatric cancer researchers, building tooling and also putting on training workshops to try and make it easier for people to work with their own data, work with publicly available data and improve their own reproducible research practices.
19:59: Jackie is a four, a colleague of Casey's that experienced his labs policy of reproducible practices firsthand.
20:06: She has carried them forward in her own lab and continues to be the director of the C CD L, teaching pediatric cancer researchers all around the country about reproducible practices.
20:16: So I am mostly at this point in my career, supervising people who touch code and data every day as a person in a leadership position.
20:25: One of the things that helps me sleep at night is the fact that we have code.
20:29: So how this helps reproduce going back to that definition that we talked about before.
20:34: It's a different analyst running the code who has access to the same data and the same code and they get the same results.
20:40: One thing I think about a lot and has been very influential for me is Hillary Parker's preprint from 2017 called opinionated analysis development.
20:49: And one of the points Doctor Parker made in that preprint is that code review explicitly allows you to test certain facets of reproducibility.
20:57: And also I would say certain reasons we find reproducibility valuable.
21:02: It's transparency into work, it's the ability to validate work.
21:06: It's the ability for someone to build off of it.
21:08: Jackie also encourages everyone who wants to start working reproducibly to start implementing these practices.
21:14: Even if they're nervous about getting started.
21:16: The most important thing I think people can do is start now like earlier on in someone's career, a person who is touching data, touching code.
21:24: I think it's important to learn by doing.
21:26: The most important thing is to start trying to work reproducibly if you don't use version control systems yet, start using it, get frustrated by it, delete the repo off your computer, go for a walk, reclaim the repo, do all of that and then you will get better over time.
21:43: And can you work with folks in your lab to set up code review?
21:48: Now, I think it's gonna take your supervisor to be supportive of that for that to really go as well as it can for it to be optimal.
21:56: But can you start working from the premise that eventually someone else is going to look at your stuff, you write code as if you are going to show it to someone else, you document your metadata as if you're going to show it to someone else because often case you are right, often your funder or a journal is gonna make sure that you do that or require that you do that.
22:19: And then six months from now because you've put in the work to document that metadata to document that code and your memory is not perfect because almost no one says you are also going to reap the benefits of putting that into practice.
22:42: So in this episode, we talked a lot about how there are such added complexities.
22:48: Now with the tech advances that have come about have now made science more complicated than ever.
22:54: And that means the decisions around the data have become more complicated than ever.
22:58: And in order to make sure that the science we are doing is leading to some sort of truth that is going to help cancer patients, we need to be reproducible.
23:08: And I'm actually really hopeful about how many conversations folks are having about reproducibility.
23:13: It's certainly on the forefront of people's minds and there are more grants going towards it, projects, experts who are focused on it.
23:21: So I think that that means that the future will be bright when it comes to data and not only just data but also data that helps us learn about cancer research.
23:31: The other really great thing about reproducibility that has really helped me is that it really makes your life easier as a scientist, it seems like a lot of work upfront and it kind of is to learn how to use all of these tools and to figure out all these skills.
23:45: But now I no longer have files that are named final, final, final project.
23:51: And I have another one called final, the most final.
23:55: And I'm not sure which one is actually the final version of my code.
23:59: I don't have any of that anymore.
24:00: It's just an incredible tool to help you be super organized naturally without really trying and to help you collaborate with others with these tools to help you do code review and pull request.
24:14: So obviously, Carrie and I are big fans of the reproducibility skill set.
24:18: And you might ask yourself at this point, you've said the word reproducibility a lot.
24:23: But like, how do I get actually started the skill sets?
24:26: I need to incorporate this if you are a researcher and you are like that right now, we do have our website I T C R training dot org that has a lot of courses on there.
24:36: But a lot of them also about reproducibility.
24:38: So we'll point you to there if that is kind of the boat you're in.
24:43: But if not, and you just like joining us for talking about data, then you know, happy to have you, there are a lot of materials out there to help you get started with github.
24:52: Some people find them really challenging.
24:54: We've tried to make our resources really user friendly.
24:57: So I recommend checking them out and we hope that you find them useful in our next episode.
25:02: We'll look more closely at the role of the data scientists in research and how this role has changed in the era of big data.
25:10: Thank you for listening to Cracking the cancer code.
25:13: This podcast is sponsored by the National Cancer Institute through the Informatics Technology for cancer Research program grant number U E five C A 254170.
25:23: The views expressed in this podcast do not reflect those of our funders or employers we would like to especially thank Roger Peng Casey Greene and Jaclyn Taroni for helping us out with this episode and talking to us about reproducible practices.

Cracking the Cancer Code

Cracking the Cancer Code

Reproducibility as non-negotiable

Candace Savonen

Carrie Wright

Elizabeth Humphries

MJ Wu

Casey Greene

Jaclyn Taroni