Eric Nielsen: One of the most exciting things is when you get a chance to visit one of the LCFs in person and you see one of these systems on the floor, it's as far as the eye can see, in the machine room, And when you look out over a system like that and realize that you've put together a piece of software that can leverage that beast, effectively after so many people have invested so much research and time, designing that system and getting it on the floor, getting it up and running, and then we can actually bring it to bear on real science and engineering problems, that's really what it's all about. 

(HOPEFUL PIANO MUSIC PLAYS)

Sarah Harman: Welcome to Direct Current. I’m your host Sarah Harman. And that was Eric Nielsen, a senior research scientist at NASA’s Langley Research Center. He’s talking about working with the Frontier—one of three exascale supercomputers in the existence. At the time of that recording, Frontier was the fastest of the exascale supercomputers. Today, it's the second fastest publicly reported computer in the world, only to be topped this past November 2024 by El Captain at the Department of Energy's Lawrence Livermore National Laboratory. Technological advances are moving quickly in this space thanks to the incredible work being done at DOE's National Labs. 

(DIRECT CURRENT THEME PLAYS)

(PEACEFUL SYNTHESIZED MUSIC FADES IN)

Sarah Harman: After more than a decade of planning and work, these powerful computers are now enabling groundbreaking research that wouldn’t be possible otherwise. With research in cancer, biofuels, space exploration, and more, exascale computing is bringing scientific discovery to new heights.

Sarah Harman:  But what exactly is exascale? Here’s Lori Diachin, the Director of the Exascale Computing Project at Lawrence Livermore National Lab, to explain.

Lori Diachin: First just to define exascale, that's 10 to the 18th floating point operations. So adds or multiplies per second. And so if you can think about that, that's a billion, billion operations that can be done every single second on an exascale machine. 

Sarah Harman: To put that in context, a billion billion operations is as if as every single person in the world completed one math problem a second for nearly five years straight. Scientists use the term exaflop to refer to this level of power. 

Sarah Harman: But the Department of Energy didn’t support the development of these machines to have power for the sake of power. Three exascale computers currently exist in the United States, one at the Oak Ridge Leadership Computing Facility, one at the Argonne Leadership Computing Facility both of which are DOE Office of Science user facilities and finally one at the Livermore Computing Complex. We developed them to help solve some of the world’s greatest challenges. Lois Curfman McInnes, a senior computational scientist in the Math and Computer Science Division of DOE’s Argonne National Lab, discussed how exascale has changed her work.

Lois Curfman McInnes: Exascale is enabling numerical computing teams to tackle problems that otherwise would be impossible, because exascale enables such a dramatic increase in power and performance. 

Lois Curfman McInnes: And together, all of those enable our community to address a whole host of scientific applications that are not only mission critical for the Department of Energy but also enabling the broader industrial community and research community to tackle many, many issues.

 (MUSIC FADES OUT)

Sarah Harman: So, Scientists submit their ideas for projects to run on the computer. A team of experts then chooses the most urgent ones that will use the computers most effectively. These projects take advantage of an exascale computers’ ability to run huge simulations and analyze massive amounts of data.

Sarah Harman: Exascale computers are particularly well-suited for artificial intelligence and machine learning. One of the areas where artificial intelligence has been essential is research on cancer and pandemics. Heidi Hanson is the Group Lead of Biostatistics and Biomedical Informatics at DOE’s Oak Ridge National Laboratory. She’s using exascale computing on several projects. One is to extract information for medical records to improve cancer surveillance. Another one is developing models to identify sensitive or vulnerable populations in future pandemics. 

Heidi Hanson: What we do is we train our own models from scratch using a large number of reports. We have over 9 million pathology reports that we can use to train our own large language models that are unique to that type of language that we see within medical reports, and what we find is if you are using that type of model trained on the text that is specific to the field, you do much better than if you were to pull a ChatGPT model, do the same task, because the underlying data that goes into that model is more general. 
 
(SERIOUS ELECTRONIC MUSIC FADES IN)

Sarah Harman: These types of models – including ChatGPT - work by looking for patterns in text and then using those patterns to predict what words come next. While chatbots are trained on the entire internet, researchers must be much more selective. These huge numbers of reports ensure that the large language models have enough of this specialized data to provide useful, accurate information to the researchers. 

Heidi Hanson:  We really are trying to make sure that what we're developing is representative of the U.S. population and will benefit all populations equally. You cannot do that without data, and you cannot process or create models from that amount of data without something like exascale computing. 

Heidi Hanson: Without a lot of compute power, what you're talking about is training a model over, I don't know, 10 years, depending on the size of the model, and we really don't want to spend 10 years waiting for our model to complete its training process, and so we are using exascale to speed up that training process. So exascale allows us to basically be on the frontier of AI, develop these large language models, and train them in a relatively fast way so that we can provide the tools that we need to make decisions about diseases that are being spread throughout the community in near real-time as opposed to years after we are able to work with the data.
 
Sarah Harman: These types of models allow scientists to learn about how diseases are affecting different groups in real time. This real-time information can help public health experts make informed decisions and focus their efforts. Exascale also enables researchers to better protect patients’ privacy.
	
Heidi Hanson: I want to make sure that I am absolutely protecting information about patients, protecting information about individuals, health trajectories at the highest possible levels of security, and finding places where you can have the power of compute that is available here at Oak Ridge National Laboratory with our exascale computing facilities and the privacy needed to process this type of sensitive information is very difficult, and so it's -- being in a space where I can do both is absolutely fantastic and really allows us to innovate in the health space in a way that nobody else can.

(MUSIC FADES OUT)

Sarah Harman: Another area where exascale computing is essential is developing and using cutting-edge biofuels for clean power, heat, and transportation. Jacqueline Chen is a senior scientist at the Combustion Research Facility at Sandia National Laboratories. 

Jacqueline Chen: Exascale computing has really enabled us to probe the details, the nuances, of fuel chemistry of these alternative fuels when you couple it with turbulent mixing, and pressure effects, and turbulence. Usually, many of these devices operate at high pressure where these interactions differ a lot than what happens at ambient pressures where it's often tested. It's hard to make tests at very high pressures because of the optical -- limited optical access. 

Jacqueline Chen: And when you have alternative fuels like hydrogen and ammonia, those have very different properties, chemical and physical properties, to the more traditional conventional fossil fuels. And those nuances in the chemistry differentiate their -- the fuel's behavior as you go to larger scales.  You need to drill down, understand the small scale physics and their interactions in order to develop coarse grain predictive models that will actually work to help design and optimize designs that utilize these alternative fuels.

Sarah Harman: Jacqueline has been studying computational combustion and reaction flow to improve efficiency for more than 25 years. So, she’s seen how computing has evolved over time.

Jacqueline Chen: I'm really excited because I think we're at a stage where there is this growing diversity of fuels that will replace or reduce our dependence on conventional fossil fuels. And, you know, finding a solution to enter clean energy transition is enormously important. And we're on a very stringent tight timeline to get there. And so, I think computation is an essential tool to allow us to do this.

 (DRAMATIC STRING MUSIC FADES IN)

Sarah Harman: Not all research that uses exascale computing is limited to Earth. Scientists at NASA are using exascale computing to accomplish a goal that’s at the heart of many science-fiction stories – bringing humans to Mars. Here’s Ashley Korzun, an aerospace engineer at NASA's Langley Research Center. 

Ashley Korzun: Right now, we use parachutes to land on Mars, but we want to land bigger and bigger things. We want to do more science. We eventually want to send people to the surface. You're not going to use a parachute. You're going to use rocket engines instead. So, a lot of work in that area is to make sure that's going to be a technology that we're ready to implement and fly.

Ashley Korzun: We have had an awesome opportunity for the last several years to utilize Department of Energy exascale computing resources on both Summit and now Frontier to work toward and understand what does it take to simulate vehicles that land on Mars. Why is exascale computing important to us to do this?  These vehicles are the size of a two-story condo. They're just gigantic. Anywhere we want to go that isn't Earth, it's impossible for us to fully test those systems either on the ground or to, you know, do fake landings and fly here at Earth. We just fully can't understand or replicate the environments and the conditions that those systems will experience. So, for us to have confidence that they're going to work the way we need them to work, we are really increasingly reliant on simulation to do this. 

Sarah Harman: Just a simple model of such a spacecraft would be difficult, considering how it’s a radically different design from previous missions. The combustion of the rocket engines causes the air to flow around the spacecraft very differently than it would with parachutes. But the aerospace engineers need much more than a static model. They need to model how the physics change as the craft moves in space. Exascale is allowing them to hook right into the flight mechanic software and run the trajectories of the mission in the supercomputer itself. That’s something they’ve never been able to do before on this scale. So far, they’ve been able to run 30 seconds to a minute of real physical time in great detail. 

Ashley Korzun: In theory, we could run some of these simulations on our supercomputing resources. But it would take nine months, 12 months, more of just continuous running, whereas these are answers that we can get within days to weeks, utilizing exascale systems.

Sarah Harman: Eric Nielsen – who you heard at the beginning of the program – works with Ashley on this project.

Eric Nielsen: Exascale has been exciting because you know that it is going to be opening doors and already has opened doors to things that we never thought, you know, even 10 years ago that we thought we might be able to tackle computationally. 

Sarah Harman: The journey to build exascale computers or their software wasn’t easy. About 15 years ago, exascale was just a twinkle in computer scientists’ eyes. Al Geist was there at the very beginning. He was the chief technology officer of the Exascale Computing Project, a DOE initiative to develop the software and other aspects of an exascale computing ecosystem. 

Al Giest: When we first started in 2008, 2009, there were a number of very important papers that came out where researchers had sat down and said, "Okay, we want to build an exascale computer. What would it take to do that?" And these papers said, "Oh, wait, there's a serious problem. It may be impossible to build an Exascale computer." 

(MUSIC FADES OUT)

Sarah Harman: There were four major challenges. One of the biggest ones was the vast amount of power that would be needed to run the computers. Here’s Lori Diachin again.

Lori Diachin: If you took the power that it takes to run the supercomputers that existed at the time and projected that forward into the exascale era, it would have taken a nuclear power plant to run the supercomputer. And so that was just not going to be acceptable. 

Sarah Harman: Besides the sheer amount of power needed, it would have racked up a $150 million electrical bill. 

Lori Diachin: As we were designing exascale computers, the goal we gave the vendors was to deliver an exaflop of performance within 20 megawatts of power. And again, the original projections were that it was going to take a gigawatt of power, so bringing it down to 20 megawatts was a huge challenge that needed to be overcome. 

Sarah Harman: The utter scale of the computing system needed to reach the power of exascale brought its own issues. Just getting enough processors working together was a lot. Here’s Al Giest again discussing this challenge.

 (TENSE MUSIC FADES IN)

Al Giest: So, we did this calculation for Exascale, and it said you would need a billion people or a billion computers all working together to solve the problem, to get to the scale of an Exascale. And we were thinking, "My goodness, we have no idea how to get a billion different tasks all working together without getting in each other's way."  

Sarah Harman: Fortunately, one approach helped solve both of those problems in one fell swoop. Unfortunately, it brought another set of challenges.

Al Giest: Well, the biggest step that we made in energy was to start to use something called graphical processing units. These are GPU's today. You sometimes see it in the news as being what the big AI companies are using to train their new AI models, like ChatGPT. But it was really the Department of Energy that pioneered that work way back in 2012 when we built the first big computer out of these graphical processing units, GPU's. And that saved about 10 times over what a regular computer chip takes in terms of the amount of computing you can get for the same amount of energy. And the next factor of 10 we got by using multiple GPUs for each computing unit that was there. 

Al Giest: The parallelism, interestingly, we solved by the same issue, the graphical processing units actually hide a lot of parallelism. So just one step in a graphical processor is like 1,000 computing units all doing the same thing at once. And so, if you were building a house, it would be like a line of a thousand people with a hammer, and each one is hitting the nail. "Okay, let's hit the nail again. Okay, all the nails are in, let's move to the next board." And so, it hid a lot of the parallelism, so the user didn't have to figure out how to do a billion-way parallelism. They only had to figure out how to do about a hundred-thousand-way parallelism, which was still hard but doable. 

Sarah Harman: Previously, graphical processing units or GPUs had mainly been used to display graphics for video games, not for computing. This new system of GPUs didn’t work the same way as the older ones. So, software programmers like Lois had to take a whole new approach.

Lois Curfman McInnes: One of the key challenges in this technology transition has been to enable all of our teams to embrace the heterogeneity of the new exascale architectures, which combine CPUs and GPUs. 

Lois Curfman McInnes: So, that's in some ways been challenging, but also very exciting, because we as a community have found ways to move forward where we've looked very carefully at software design and built new technologies that enable us to layer on top of functionality that enables us to use these heterogeneous architectures and to do so in a performance portable way. 

 (MUSIC FADES OUT)

Sarah Harman: Some of the other challenges are ones that any computer user is used to, like unreliable parts. But on a much, much bigger scale. 
Lois Curfman McInnes: There are always challenges that we face when standing up a system around how the hardware is coming together, how the software stack is interacting. Often when these machines are stood up at the national laboratories, it's the first time they've been stood up at that scale by the computer vendors.

Sarah Harman: With the equivalent of a billion computers, there are inevitably going to be hardware failures. The engineers calculated that the failures were going to happen faster than it would be possible to fix them. Not exactly practical. Here’s Al again on how they solved that issue.

Al Giest: But the way it was solved is that the vendors, when we were investing in trying to get them to solve these four big challenges is to say, "We need the computer to keep running even if it's got failures inside of it." So, as it's running along, if something goes bad, you simply detect that it went bad, you mark it off the radar, you're not going to use that part that's broken, and you will just continue running fine with whatever's left. And that part gets fixed and replaced a few days later, and it incorporates it back in. And so even though there's constant failures inside the machine, it is able to continue running even though that happens and you can continue to solve problems.

(UPLIFTING PIANO MUSIC FADES IN)

Sarah Harman: These issues - along with other challenges - required huge investments to tackle. Enter the Exascale Computing Project. Starting in 2016 and running for seven years, DOE provided $1.8 billion dollars to make exascale possible. While the hardware itself was substantial – it ran about 600 million dollars a computer – it also required an entire ecosystem of support around it. The Exascale Computing Project created that framework. 

Sarah Harman: And in 2022, the promise of exascale was fulfilled.  The first exascale computer was Frontier at the Oak Ridge Leadership Computing Facility. Aurora followed in 2023 at the Argonne Leadership Computing Facility. In 2024, El Capitan at Livermore Computing Complex topped the TOP500 supercomputing list, bumping Frontier and Aurora to the No.2 and No. 3 spots. As of fall 2024, DOE’s national laboratories are home to 3 of the top 10 most powerful supercomputers in the world. 

Sarah Harman: But the advances of exascale aren’t limited to the DOE’s national laboratories. They’re changing the whole world of computing. GPUs – the graphic processing units mentioned earlier that are used in gaming systems and AI – are becoming more common in all sorts of computers. They could help save energy in everything from huge data centers to laptops. Here’s Lois agian:

Lois Curfman McInnes: It has enabled our community to develop a set of reusable software technologies, part of our software stack called E4S or Extreme Scale Scientific Software Stack that then enables others, for example, people in industry and university settings to leverage our software technologies so that they, too, can then adapt to heterogeneity in architecture.

Sarah Harman: Contrary to the stereotypical image of a lone scientist toiling away late in the lab, none of this could happen without a huge team of dedicated people. Thanks to exascale computing, people in seemingly unrelated fields in science are collaborating. Here’s Heidi:

(MUSIC FADES OUT)

Heidi Hanson: The reason why we're working at DOE is so cool because we're working next door to the climate modelers, right? And so we run into all these problems all of the time in terms of what should we be thinking about? How do we model this? And you can talk to somebody in an entirely different application space that may give you insight into what's happening on -- and you may be able to change your research because of that. So not only do we have the compute infrastructure here, but the diversity of scientific thought that's really, really cool. 

(UPBEAT MUSIC FADES IN)

Sarah Harman: Exascale didn’t just move forward the next generation of computing, it helped train the next generation of scientists as well. As director of the Exascale Computing Project, Lori said this was one of her favorite parts of the entire project. 

Lori Diachin: We trained well over 2,000 scientists to be able to take advantage of these kinds of architectures, which I think is just preparing us, the Department of Energy and the nation to better leverage these kinds of systems for the next decade and beyond. 

Lori Diachin: It was just a phenomenal project, and I'm just so happy and proud that I was a part of this project.

Sarah Harman: We’re still at the beginning of exascale computing’s journey. While scientists have already made fantastic progress, there’s so much more that can be done with the current and upcoming exascale computers. And DOE engineers and computer scientists are already looking beyond exascale – at both the next generation of conventional computers and the unlocked potential of quantum computing. We here at the Department of Energy are committed to supporting unique, world class facilities that enable us to find the best solutions to our world’s biggest challenges. 

(MUSIC FADES OUT)

(OUTRO MUSIC PLAYS)

Sarah Harman: That’s it for another episode of Direct Current! Thank you to our guests Al Giest, Lori Diachin, Heidi Hanson, Lois Curfman McInnis, Eric Neilsen, Ashley Korzun and Jacqueline Chen for lending their expertise.

Sarah Harman: If you want to learn more about exascale computing as well as see animations from the computer simulations, check out our show notes. You can find those, along with our other episodes, at energy.gov/podcast. 

Sarah Harman: Direct Current, and our episode artwork, is produced by me, Sarah Harman. This episode was written by Shannon Shea. Music and sound editing by Michael Stewart. 

Sarah Harman: This is a production of the U.S. Department of Energy and published from our nation’s capital in Washington, D.C. Thanks for listening!

(MUSIC FADES OUT)