Science in Real Time (ScienceIRT) podcast serves as a digital lab notebook—an open-access, conversational platform that brings the stories behind cutting-edge life science tools and techniques into focus. From biologics to predictive analytics and AI-powered innovation, our guests are shaping the future of therapeutic discovery in real time.
Welcome to Science in Real Time, a podcast produced by Araceli Biosciences where we share the stories behind the breakthroughs shaping the future of drug discovery. I am your host, Carli Reyes, and today I'm joined by Dr. Naina Kurup from Ginkgo DataPoints, a business unit within Ginkgo Bioworks. We'll be exploring the release of their GDPx3 dataset; so what is it, how it came to life, and the possibilities it opens for the life science community. Let's dive in!
Carli:Naina, welcome to Science in Real Time. I'm very excited to have you here. I also wanted to start by thanking you for taking the time during what I imagine has been a very busy post release moment. How has that been for you folks?
Naina:Definitely thank you for having us, Carli. It's so great to talk to you and the rest of the RSLE team. Yeah, we released the dataset that we're talking about, GDPx3, around a couple of months ago at this point. And we also just last month released the dataset on Hugging Face. So it's been a lot of people reaching out to us to collaborate or get more information about the dataset.
Naina:So it's been exciting times.
Carli:That sounds wonderful. And I was wondering if we can go ahead and start talking a little bit about really the beginning of this data set. And I think that before we go and understand this data set, I wanted to see if you can tell us a little bit about your role in Ginkgo Datapoints and what originally pulled you into this kind of work, especially since it's such a cool blend of biology and systems thinking.
Naina:Of course, yeah. So I come from an academic background. Ginkgo is actually the first industry role I took on. I joined Ginkgo around a little over a year and a half now. And my experience before this through grad school and my postdoc was mostly in using all sorts of microscopy techniques to analyse different aspects of neuroscience, including synapse formation, axon regeneration, and all sorts of dynamics within the cell.
Naina:So I found out about the role at Ginkgo through a friend, and it seemed like a great fit for my expertise and the capability of building out a platform because at that point, Ginkgo was transitioning into a business unit called Ginkgo Datapoints , where our goal is really to build platforms that serve as a way for generating data for drug discovery.
Carli:Wonderful. That sounds fantastic. And can you talk a little bit about what is the difference between Ginkgo Datapoints and Ginkgo Bioworks? It seems like Ginkgo Data Points, it's a subsection within the Ginkgo Bioworks. Am I getting that correctly?
Naina:Yes, yes, I do understand. This is a question that we get quite often. Ginkgo Bioworks is a pretty well known name in the biotech industry. They've been around since 2008. Ginkgo Datapoints is a business unit that we inaugurated around a little under a year ago, August 2024 as part of Ginkgo Bioworks.
Naina:And the goal is really the same. Ginkgo Bioworks and Data Points wants to make biology easier to engineer for everyone. And specifically, the focus of Ginkgo Data Points is large scale data generation for mammalian products, including drug discovery, target validation, and other assays that can span a range of customers, including academia, tech bio companies. We provide AI ready datasets, as well as for big pharma. And it's been a year.
Naina:It's been a very exciting ride so far. And I'm glad we get a chance to talk about it.
Carli:That's amazing. That sounds incredibly fantastic. And I'm very glad to start learning a little bit more about how Ginkgo Bioworks is expanding. And that brings me then to talking about the protagonist of the day, the GDPx3 data set. And as I'm sure that our listeners have been seeing, that name has been floating around a lot lately, especially in the systems biology and data circles.
Carli:So for people hearing about it for the first time, how would you describe or what would you say exactly is the GDPx3? And why would you say it's such a big deal right now?
Naina:Of course. So as all of us are aware, the world is definitely moving towards a more AI focused approach towards many aspects of our life and drug discovery is no exception. And one thing that all the AI models need is data. And at Ginkgo Bioworks, we really want to enable large scale data generation for such efforts. As a start for that, and just provide data for open science, we've decided to use our platform to generate data sets that are freely available.
Naina:So the name GDPx3 really translates to Ginkgo Datapoints Data Drop 3. So there's a GDPx1 and a GDPx2, and then we have GDPx3. So GDPx1 and GDPx2 are data sets that primarily involve transcriptomic data, so we've done large scale compound library perturbations in different cell types and generated transcriptomic data using a technique called DrugSeq. And GDPx3 is a complement to those two data sets, use some of the same cell lines and some of the same compounds to generate high content imaging data using cell painting. So the hope is that either on its own, GDPx3 can be used for model training or drug discovery in different ways or in combination with the GDPx1 and x2 datasets that we've released.
Carli:That sounds wonderful. And I cannot help but want to get into the details of it, because I think that that's where it really gets very fascinating. So I was wondering, I hear a little bit about the GDPx3, GDPx2, GDPx1. What kind of data specifically would people be able to find within the GDPx3? I know we have transcriptomics, metadata, perturbations info.
Carli:How can you break it down for us?
Naina:Of course. So I think when we talk about GDPx3 is great to talk about it in combination with GDPx2. So both data sets, the fundamental perturbation is chemical perturbation. So we use small molecules from a publicly available chemical library called the Lopac library. So we use a subset of those compounds and we use different cell types.
Naina:The advantage is we use a mix of primary cell types as well as commonly used cancer cell lines, and then we dose these cells at different concentrations of these small molecules and then do a readout of either transcriptomics, which comes as GDPx2, or high content imaging, which comes as GDPx3. So when we give you the GDPx3 data package, what you would get is the raw image files, you'd get a little explainer of how we've acquired the data, and you'd also get all the associated metadata for each sample. Each sample would be available in this situation, and you'd have information about what cell type it is, how long it's been treated, what concentration of the compound was used, and how many replicates we have of the same sample in our data set. And all of this hopefully will help you to use the data set well.
Carli:I cannot help myself to think that this definitely doesn't sound like just an upload and go situation. It's definitely something that requires a lot of preparation. And I can imagine that there were a few or a couple hurdles, not only scientifically, but operationally with something of this scale and magnitude. I'm wondering what are some of the challenging aspects of getting this data set into the shape that it's now come to your mind when thinking about how did this come to be scientifically and operationally?
Naina:That's a great question, Carli. So this was, to start with, an effort across a team of people, including imaging scientists like me, bioinformaticians, and cyclic computing folks. And the idea was that we wanted to showcase our ability to generate such data sets at scale and with rapid speed. Something about GDPx3 that we're quite proud of is that we were able to go from actually seeding the cells in the plates to getting the primary analysis of the data in one week and how the perturbations that we've done include getting cells seeded at a certain density, then doing compound perturbations at different concentrations, including positive and negative controls. And we also image these cells at a twenty four hour time point and a forty eight hour time point.
Naina:So the goal here is that you have diversity of data in both the concentrations of compounds as well as the time points that you get this data.
Carli:That sounds wonderful. And I absolutely agree that a great team definitely does make the difference. And I think this is a perfect example of how you folks are basically moving the needle forward and creating a path for what science is going to be in the future. And that is such an honor to be able to have you folks in the podcast, naturally. And this naturally then makes me think about what unlocks this dataset?
Carli:So basically, what or who do you think stands to benefit the most from a dataset of this nature? Is it mostly tailored for academic researchers or perhaps pharma pipelines, synthetic biologists? Or alternatively, is it really for anyone just working at the cell level? What do you think?
Naina:So we're definitely user agnostic when it comes to who wants to use our data set. We've had download requests from both our website and Hugging Face, and there's been a diversity of folks who've downloaded the data set and asked us questions about it. We've had people from academia, people from smaller tech bio companies, as well as larger pharma and biotech companies reach out to us for this dataset and additional collaborations that we're excited about. So definitely whoever benefits is our user base. I think especially for academic labs that probably do not have the resources to generate such a dataset, I think this is really valuable.
Naina:And for tech bio companies that are starting out, want to see what kind of data that Ginkgo can generate for them, this is a great primer to what we can do for them and how we can partner with them.
Carli:That sounds marvelous. And I hope that all of our listeners are taking note cause this is an incredibly good resource for all of you folks to have. And this is also a great segway for a big player that I wanted to think and talk about a little bit with you, which is this cell painting. It is such a powerful and elegant technique, but I know that it's not without its challenges. So I was wondering what did the cell painting in particular, that workflow look like in this context?
Carli:And is there anything that you learned from it?
Naina:Of course, I think cell painting definitely has been modified to adapt to different cell types and different questions. So we kind of modified the cell painting protocol a little to suit the workflows that we have. And something that really helped push the workflow forward was having an Araceli Endeavor Microscope, because previously efforts at Ginkgo and at other places have used other high content imaging systems where getting that cell painting data, which is imaging in four to five channels, and you do around four to five sites per well and a three eighty four well plate would take hours per plate. Whereas with the Endeavor, we've been able to generate the dataset in under ten minutes for one plate, which has definitely been something that really drove us to get such a speedy, scalable work group. So we can do both the actual staining and imaging of all the plates in the same day, which has been definitely so much easier to work with than with previous techniques.
Carli:That sounds incredible, especially as a scientist myself, imagining a plate being read in less than ten minutes sounds glorious, honestly. So, it really sounds then that if we were to take a step back and think about how this dataset was able to be generated and in the timeline that it was generated, because I think that it's the key part here, that it's something that could have been generated, but never as close as it was generated with Endeavor. So it seems like the Endeavor is a technology or an infrastructure standpoint that seems very promising for cell painting. Would you agree with that?
Naina:Oh, definitely. I think so this is just one of the data sets that we've consistently delivered to customers in short timelines, especially when, you know, if you imagine a lab in the loop scenario, you have things that you want to test and then we generate the data and then the partners come back to us with modifications of what they want to test. So having that sort of turnaround time of a week from designing the experiment to actually having that primary data has been really helpful and the Endeavor has been key to getting to that speed and scale.
Carli:And I'm also assuming this particular stage of the generation of the data set, it's where some of Ginkgo's internal muscle really showed up. You folks are very strong when it comes to automation, the bio funding infrastructure and even the data engineering. So I was wondering, did that play a role in pushing the boundaries of what you folks were doing with this dataset?
Naina:Definitely. So especially coming from an academic background where I didn't have this sort of infrastructure to play with, this was just definitely eye opening and a huge growth experience. So we definitely like, even if the self painting assay was new to us, the existing infrastructure that we had for automating all of our assays really helped drive this through pretty quickly. And what I really have to mention is the support that we have from our in house LIMS system, we can directly, as soon as we design the experiment, we can ensure that all of the metadata is tracked through all the different, from cell seeding to perturbation to image data collection and analysis. Our metadata is preserved and it's as an experimentalist, it's something that I don't even have to think about or worry about.
Naina:So that's been a great, benefit to being in that Ginkgo umbrella and having such great infrastructure to work with. We also had lots of help from the scientific computing team so that the data was just seamlessly uploaded from after acquisition. So it's something that you acquire the plate and then you don't have to really think about the data until the data analysis is complete. Then you have images that have passed through QC that you can then look at for more downstream analysis. Without all of these things, I don't think we would have reached the scale we have at this point.
Naina:And for that, I'm really grateful.
Carli:It's been such a pleasure to be able to witness how you folks have been able to enable this sort of technology and really being able to witness how everything can come together to create such an amazing tool that not only scientists are going to be able to use, but even beyond. You folks are certainly pushing the boundary. So I cannot help but think about where do we go from here? Is there a phase two coming from the GDPx3 dataset, Or are you folks thinking about an entirely new dataset now that the infrastructure is in place? Tell me a little bit about the future, what it holds for you folks.
Naina:Of course. Yeah, so it's a little bit of both actually, Carli. So we have plans of doing our own analysis to integrate the GDPx2 and x3 data, and that should hopefully come out in the next couple of months. So there's interest from other groups as well as ours to combine transcriptomics and phenotypic high content imaging sort of data with metabolomic data, as well as data from ADME studies. Ginkgo Datapoints recently released an ADME service, so it would be interesting to see how we can synergise with those offerings as well.
Naina:And what GDPx3 was really about was chemical perturbations. We were looking at a small molecule library, something that we have been developing in house and have data for customers, but don't really have a public data drop for is how genetic perturbations affect the transcriptome and the phenotypic space, and that's something that we're looking forward to sharing with the community very soon.
Carli:Of course. And for folks who are particularly excited and want to start exploring the GDPx3 dataset right now, where should they go to be able to find this dataset?
Naina:Of course, so our datasets are available through both downloading directly from our website or through Hugging Face. And while people download the data sets and from either source, they reach out to us with questions that we are very happy to answer. And they also reach out to us with collaboration or partnership requests. I would say keep that coming. If you're more interested, reach out to our website.
Naina:We also have a LinkedIn page that we constantly talk about what's happening in the Ginkgo Datapoints space and how you know, partners can collaborate with us. So that's definitely a space to look out for and definitely just reaching out to anyone on the team. The Datapoints website has information about people in the team. So that would be a great resource as well. And of course, happy to answer any questions if people want to reach out to me directly as well.
Naina:Wonderful.
Carli:That sounds incredibly amazing. And that basically concludes our interview for the day. It was amazing to have you Naina. Thank you again so much for giving us a glimpse of what is happening behind the curtain with the generation of the GDPx3 and beyond, and also giving us a little bit of a taste of what we should be expecting from Ginkgo Datapoints in the future. It seems like there's a lot of great things in the horizon and this certainly isn't just a data set, it's a whole new level of biological context.
Carli:And I for one cannot wait to see what you folks discover and how you folks make it possible. And hopefully we will certainly cross paths in the future. I would love to have our listeners learn more about the other data sets that you folks are generating. So it was truly, truly a pleasure, Naina.
Naina:Thank you so much, Carli. I always I think it's just like a holdover from academia. I love talking about the science that I'm doing and Ginkgo Datapoints to a very open team that's really interested in collaborating and promoting open science as well. So happy to talk to anyone who would be interested and thank you for having us on your platform.
Carli:That's it for today's episode of Science in Real Time. A big thank you to Dr. Naina Kurup. for sharing her insight into the GDPx3 dataset and the innovation behind it. If you enjoyed this conversation, don't forget to follow, subscribe, and hit the notification bell so you'll be the first one to hear about our latest episodes. Until next time, I'm Carli Reyes, thanks for listening.