HPE news. Tech insights. World-class innovations. We take you straight to the source — interviewing tech's foremost thought leaders and change-makers that are propelling businesses and industries forward.
Aubrey Lovell (00:10):
Hello and welcome back to Technology Now, a weekly show from Hewlett Packard Enterprise where we take what's happening in the world and explore how it's changing the way organizations are using technology. We're your hosts, Aubrey Lovell.
Michael Bird (00:22):
And Michael Bird. So today, we're looking at an area of tech that doesn't necessarily get the attention it deserves, but is going to become increasingly more important over time: network fabrics for artificial intelligence. Now, as AI continues to evolve and develop, so does the scale and complexity of the workloads that they are expected to handle. The network fabric is the backbone of what is keeping everything together between the storage, the compute, and the users. That's a very important job. Users are increasingly expecting AI to be quick, responsive, and reliable. Making that possible whilst also improving efficiency and making the most of available resources is a challenge, to say the least. So what can be done to improve this backbone of AI, is AI able to help improve itself, and what does this all mean for our organizations?
Aubrey Lovell (01:19):
Very, very good questions indeed. And if you're the kind of person who needs to know why what's going on in the world matters to your organization, this podcast is for you. And if you haven't yet, subscribe to your podcast app of choice so you don't miss out. All right, Michael, I'm excited for this one. Are you ready?
Michael Bird (01:35):
Yes. Let's do it.
Aubrey Lovell (01:38):
AI is complex and expensive. A report from a16z, which we've linked to in the show notes, suggests that for large AI companies, up to 80% of their capital expenditure is on compute resources. Then there's an article by Research and Markets, which we've also linked to in the show notes. They suggest that the global market for data center accelerators, essentially tools and equipment for making data centers more efficient and effective, sat at $45.1 billion in 2023 and is expected to grow to $351.5 billion by 2030. This suggests that there's a real need to make the infrastructure and networks on which AI runs more efficient, both in terms of energy usage, how they manage workloads, and how they move data around.
(02:24):
So how do we do it? Puneet Sharma is director of the Networking and Distributed Systems Lab at Hewlett Packard Labs and has been a part of extensive research into the many possibilities and capabilities of these technological advancements. So Puneet, to start, when we talk about networking in AI and ML, we're not just talking about patch cables, right? What do we mean by network fabrics?
Puneet Sharma (02:48):
It is very much like a patch cable, but you would need a superhero with sort of superpowers to be able to patch these cables billions of times every second, taking the outputs from millions of GPUs and sending them across. There are memory limitations, compute limitations, and so on, so how do you do that? You actually literally sort of do a programmable patchwork of creating these fabrics and connecting the GPUs together.
(03:14):
The other thing is, when you orchestrate these workloads, they have to actually work in nice, synchronized steps, so any stagger will actually sort of slow down the whole process, so you have to have not only low latency, you have to actually have very low tail latencies. And then we are talking about models which are taking years to train, so now you have to have a system which is resilient, because the GPUs are going to fail and you don't want to go and lose days' worth of your training, so you have to have your networks resilient so that the model training continues.
Aubrey Lovell (03:50):
This isn't just about getting an AI system to transfer more data between nodes, though.
Puneet Sharma (03:55):
You are absolutely right, because we are basically talking about a AI system which has the compute, right, these GPUs which are sort of taking these model trainings and distributing them across these different GPUs. You have to exchange data periodically. So now you want them to work in tandem. You have to balance the compute and communication so that you can actually find the optimized way of making them run.
(04:22):
So in some ways, the answer to your question about whether it is just data between nodes, yes it is, because the sheer volume of data is huge. We're talking about thousands of GPUs working in tandem with all these different kinds of parallelization strategies. But given the fact that there is also structure to the communication, now you can actually, there is an opportunity to overlap the compute with the communication.
Michael Bird (04:49):
So Puneet, traditional HPC solutions use networking. Why haven't AI-specific workloads been treated in the same way so far?
Puneet Sharma (04:58):
The collective message sizes in the HPC workloads were smaller. We're talking about let's say 1MB messages and so on. But as you sort of look at the AI workloads and how they're distributed, given that volume, the number of parameters and the size of matrices that these GPUs are computing, every time you do that collective communication between the set of GPUs which need to exchange the state, these message sizes are actually much larger. So instead of 1MB per second, we are actually talking about 64 to 56MB. So that's the sort of issue one, the message sizes are actually sort of larger.
(05:37):
The frequency of communication of this collective communication has actually changed, as well. And similarly, it's not just about because of the synchrony which is needed between compute and communication, these [inaudible 00:05:50] and the tail latencies actually have become important. So now, yes, you can actually use a lot of the learnings we had from the HPC and building networks for HPC, but these networks have to be really resilient so that you don't have to actually... Otherwise, if you lose your network and you lose your GPUs, then you basically sort of have to go back multiple days and lose all that valuable work that was done.
(06:17):
And then there is a flip side to all this also, right? The fact that these AI workloads, and this is the best part actually, sort of somebody who looks at structures in networks and patterns and networks and things like that, these are actually much structured flows. So what that means is that now that you actually have these structures for the communication between different nodes and GPUs, there is also the opportunity to optimize these runtimes and how you will run your communication.
(06:48):
One other big change from the HPC to AI also has been we are actually sort of moving from this era of CPU-based computation to GPU-based computations. And there is also the connectivity, direct GPU-to-GPU connectivity, both in a node between all the GPUs, which are in a single server to external nodes. So there are all these other challenges which have come up that at least in Research Lab, we enjoy a lot.
Aubrey Lovell (07:17):
There's different sides to AI. It's not just question in, answer out. There's obviously training as well as inference. What difference does that make?
Puneet Sharma (07:25):
So inference is actually sort of one of the really evolutionary terms when it comes to AI workloads, right? Everybody has been talking about generative AI. We go to these large language models, we ask them questions, and they sort of give us interesting answers. But we have moved on to taking all these large language models that we were training and then having the sort of discussions or interactive communication with these models and treating them as humans or trying to expect human kind of answers from them, so it's a very different kind of workload.
(08:02):
It's also multimodal. Right now, you have actually seen generative AI series where I'm doing text-to-image. So now, not only you have to actually look at the LLMs trying to interpret the query, you have to also build your image models and things. And again, just like these as the parameter space for these models have grown, it's not just the training which is actually taking much longer time, the inference can actually take much longer time. And not just that, you actually now because there is interaction with the users and the human beings, you are to actually do it in real time also. Because once I ask the query, I'm not going to wait for a month or two to get my responses back.
(08:43):
And then I guess one of the other things that you definitely would've heard of is what is called hallucination, right? These models actually are interpreting and responding them in a different manner. As you sort of increase the context size or the number of tokens in these inferences, that's how generative AI services are actually pricing these inference services for you, the expectations are that the hallucinations will actually reduce, right? So now again, larger models, more memory, more data communication that has of course the impact on the network fabrics themselves. How fast can you actually do that?
(09:21):
The other big difference between training and inference also is that now you will actually have multiple users or tenants interacting with this service with multiple requests. So unlike the training infrastructure with compute and communication and nice AI network fabrics that we're trying to build, you have to here take care of actually the fact that all these different requests might be coming onto this shared infrastructure that is there. So now you actually have in some ways other challenge of what happens to the condition that might actually happen if all these requests and users come at the same time? Or how do I actually make sure that the serving of this inference is happening without actually a significant amount of delays so that users don't leave?
Michael Bird (10:08):
What exactly is Hewlett Packard Labs examining to solve these challenges?
Puneet Sharma (10:13):
We're basically doing explorations roughly along three dimensions. One indeed is actually how do you make the network fabric for AI more optimal? Looking at the collectives. So what are the different network technologies? What kind of topologies will actually have better behaviors? And then also for inference, as I was sort of looking at, what do you do if the infrastructure is shared?
(10:38):
Similarly, we are also actually sort of looking at the different parallelization strategies. Not looking at it just from the communication perspective, but there might actually be, given an infrastructure, what kind of parallelization can I actually bring in? Should it be data parallel? How many data parallel it should be? How many shards should I actually get, number of GPUs, and so on? And depending on the infrastructure that is there, sometimes it's very difficult to figure out what the strategy or the number of parameters to use there is. So one of the big push we have is to actually leverage these extreme-scale AI workloads and simulate them. Because even before we run these actual workloads on real infrastructure, we need to be able to optimize these algorithms and figure out what our parallelization strategy will be.
(11:37):
The last dimension is what I actually sort of call matchmaking, right? Orchestration meets infrastructure. So you have your workloads you have to deploy on your given infrastructure, so how do you optimize it for various things, whether it is time for training because energy consumed, these are extremely sort of energy-consuming things that we're talking about both in terms of training and inference.
Aubrey Lovell (12:02):
Thanks, Puneet. This is such important work and we can't wait to hear more about it. So don't go anywhere because we'll be right back with you in just a moment.
(12:12):
Okay, now it's time for Today I Learned, the part of the show where we take a look at something happening in the world we think you should know about. Michael, what do you have for us today?
Michael Bird (12:21):
Well, today we are in Boston where a team are looking into ways to terraform Mars using specially bioengineered microbes. Wow. Now, making Mars a more Earth-like environment has been a dream of various scientists, engineers, and of course, sci-fi fanatics for decades. In the 1920s, ideas were put forward to develop giant space mirrors to warm the planet's surface. More recently, a certain space-loving billionaire casually mentioned detonating nuclear bombs on the planet. Fortunately, or maybe unfortunately, neither option is viable in the near future. Instead, a team has begun looking into finding the most extreme forms of life on earth and combining their genomic code in a way which would make Mars a pleasant environment for them to thrive, produce oxygen, and eventually form an atmosphere.
(13:13):
Life exists pretty much everywhere on Earth, from environments with extreme radiation to extreme cold and heat to very dry climates. The problem is, Mars has all of those problems and more. To that end, the team of researchers are beginning to look at how biologically engineered lifeforms could combine all of these incredible traits. They've created a Mars-like lab environment for their tests and are in the early stages of their engineering project. There are blockers, however, not just scientific, but legal. Biological contamination is banned in space under the Outer Space Treaty, as well as by NASA policies. And of course, we want to make sure Mars absolutely doesn't have life before we start dropping in our own. So whilst it's an interesting experiment, Mars is likely to remain barren at least for the near future. Still, pretty cool though.
Aubrey Lovell (14:07):
Indeed it is, and sounds like a lot of work, too. Very extra.
Michael Bird (14:10):
Very much so.
Aubrey Lovell (14:13):
It is. Thanks, Michael.
(14:17):
All right, so now it's time to return to our guest, Puneet Sharma, to talk about how we can overcome the challenges of AI networking. All right, Puneet, so what are the tangible benefits of making a really good, effective AI system that can compute and move data around in sync and even in parallel? How much energy are we talking about saving here?
Puneet Sharma (14:36):
The energy efficiency is of course big, right? So anything we can actually do to reduce the time that these GPUs are turned on can actually significantly impact the performance. Even if we can shave off 10% of the total time taken, we are actually doing better. There have been estimates on how much energy, let's say a 1 trillion-parameter model actually takes, anywhere from 5,000 megawatt hours to 7,000 megawatt hours. That's roughly 10,000 households are together, how much energy they spend in a day. So even if we can shave off 10% by making these things go faster, that's actually sort of absolutely wonderful. And again, the savings actually come from going faster or also actually sort of using some of the other techniques like reducing the power consumed by the network themselves. So instead of using coax cables, why not use opticals which have better power consumption?
Michael Bird (15:38):
So a whole load of inference is now fed via the cloud, so that adds a whole different tech stack. Is that a bit of a challenge?
Puneet Sharma (15:46):
Yes, I think that is indeed a challenge, and different networking technologies from different vendors have to sort of be there. We have to choose legacy data centers that we have to be able to serve these workloads also, just to be sort of price-conscious. These are interesting challenges and our lab is actually looking at using some of this. So we are actually using things like reinforcement learning to be able to come up with greater control decisions. We are working with our partners on building digital twins for HPC computer workloads so that you can actually sort of do facilities management, cooling, heating, and not just the workload orchestration or more. It's not just about managing the technical infrastructure of compute and communication, but all that surrounds these systems. Where is the power coming to be able to rate? It's not just amount of energy you will consume. You can even actually optimize and orchestrate these workloads to the places where the energy is being sourced from more cleaner resources.
Aubrey Lovell (16:52):
Thank you so much, Puneet. It's been a great conversation and you can find more on the topics discussed in today's episode in the show notes. Right, we're getting towards the end of the show, which means it's time for This Week in History, a look at monumental events in the world of business and technology which has changed our lives. And the clue last week was, it's 1956, let me type that up for you. And I have to brag, Michael, I actually got this one.
Michael Bird (17:18):
Did you?
Aubrey Lovell (17:18):
But please, let's take our audience through it.
Michael Bird (17:21):
Well, I didn't get it, but it was of course the first time a computer and a keyboard were connected. The Whirlwind computer was a vacuum-tube computer built for the US Navy in 1951. Now, it was pretty revolutionary for the time being among the first digital electronic computers that gave real-time outputs, the first to compute and use its bits in parallel, and the first to feature a graphical interface, and the first to use magnetic memory. Wow. The system had 5,000 vacuum tubes, that's a lot of vacuum tubes, 1KB of memory, and was built to run a flight simulator for training bomber crews. Wow. That meant it needed to be able to adapt its parameters on the fly, not just run one program from end-to-end.
(18:07):
Now, over time, its mission evolved to be a test bed for high-speed computing, which meant that in 1956, it was plugged into the first ever computer keyboard, oh my goodness, which allowed data entry far more quickly than through dials or punch cards. It'll never catch on, eh?
Aubrey Lovell (18:27):
And next week, the clue is since 2014, this day is sure to put a smile on your face. Do you know what that is? I don't know what that is. That's interesting.
Michael Bird (18:36):
I haven't got a clue. My goodness.
Aubrey Lovell (18:39):
Well, if you do know what it is to our listeners, keep it to yourself and we'll talk about it next episode.
Michael Bird (18:45):
Right. Well, that brings us to the end of Technology Now for this week. Thank you so much to our guest, Puneet Sharma, Director of the Networking and Distributed Systems Lab at Hewlett Packard Labs. And to you, of course. Thank you so much for joining us. Technology Now is hosted by Aubrey Lovell and myself, Michael Bird, and this episode was produced by Sam Datta-Paulin and Sydney Michelle Perry with production support from Harry Morton, Zoe Anderson, Alicia Kempson. Alison Paisley, Alyssa Mitri, Camilla Patel, and Chloe Sewell.
Aubrey Lovell (19:15):
Our social editorial team is Rebecca Wissinger, Judy Ann Goldman, Katie Guarino, and our social media designers are Alejandra Garcia, Carlos Alberto Suarez, and Ambar Maldonado.
Michael Bird (19:26):
Technology Now is a Lower Street production for Hewlett Packard Enterprise, and we'll see you the same time, same place next week. Cheers.