Technology Now

How are our networks designed to cope with the increasing demands of AI? This week, Technology Now dives into the topic of networking for AI, exploring how our networks have adapted and evolved to meet the ever growing demands of modern day AI infrastructure. Praful Lalchandani,VP of Networking Product Management, tells us more.

This is Technology Now, a weekly show from Hewlett Packard Enterprise. Every week, hosts Michael Bird and Sam Jarrell look at a story that's been making headlines, take a look at the technology behind it, and explain why it matters to organizations.

About Praful:
https://www.linkedin.com/in/prafullalchandani/

Sources:
https://www.networkworld.com/article/972044/ethernet-at-50-bob-metcalfe-pulls-down-the-turing-award.html
https://www.networkworld.com/article/970970/what-is-ethernet.html
https://computer.howstuffworks.com/ethernet5.htm

Creators and Guests

MB
Host
Michael Bird
SJ
Host
Sam Jarrell

What is Technology Now?

HPE news. Tech insights. World-class innovations. We take you straight to the source — interviewing tech's foremost thought leaders and change-makers that are propelling businesses and industries forward.

 PRAFUL LALCHANDANI
I think I wanted to be an astronaut. I am not sure how I got into that, but somehow maybe it was watching Apollo landing on the moon or something that got me there.

But, uh, I just remember thinking that's what I wanted to be, or then I went from there to aeronautical engineer and then from aeronautical engineer I ended up doing electrical engineering.

MICHAEL BIRD
Impressive stuff. So Sam, that was, Praful our guest for this week on technology. Now tell me all about what he wanted to be when he grew up.

SAM JARRELL
I love that he wanted to be an astronaut. I was far less ambitious. I just wanted to be a veterinarian.

MICHAEL BIRD
Ooh, that's not less ambitious, Sam. That's very impressive.
What's your favorite animal?

SAM JARRELL
Well, growing up as a kid, it was Tigers, and I remember watching like an animal documentary and that's why I wanted to be a veterinarian.

MICHAEL BIRD
Nice. well I wanted to be a train driver. I'm still holding onto that ambition. Maybe, uh, maybe I should have a side hustle as a train driver.
I think I'm still sort of, still sort of into trains, but. Anyway, that's, that's an episode for another time. Maybe another podcast. Anyway, Praful is not an astronaut but he does something equally as cool, 'cause he's now the VP of product, data center platforms and AI solutions at hp Networking. Now, networking, especially in the context of AI, is what we are talking about today on technology now.

So let's start the show. I'm Michael Bird

SAM JARRELL
I'm Sam Jarrell

MICHAEL BIRD
And welcome to Technology Now from HPE.

MICHAEL BIRD
So Sam, I want to very quickly draw a line between two quite similar sounding topics that certainly has had me confused a couple of times. Networking for AI and AI for networks.

SAM JARRELL
Okay. That sounds like a riddle. What is the difference?

MICHAEL BIRD
It's not a riddle. Okay, so AI for networks is where we use AI to help manage our existing networks. whereas networking for ai, which is what we are covering on today's episode, is all about how we can build networks, which can cope with a modern AI workload.
Does that make sense?

SAM JARRELL
Yeah, I think so. And I assume that would be because networks, running AI need to be able to handle bigger workloads and that sort of thing, right?

MICHAEL BIRD
Yeah, exactly. And that's why I spoke to Praful, who you heard at the top of the episode now. Now Praful’s background in networking and AI makes him the perfect person to give us a bit more insight into how AI networks have had to change to keep up with a modern demand for AI.

SAM JARRELL
Okay. That does sound fascinating, but I wanna take a moment before we chat to prole about modern networks to look back at a much older network standard that has somehow withstood the test of time.

Which means it's time for Technology Then.

SAM JARRELL
Michael, if I say the names Bob Metcalf and David Boggs to you – maybe throw in the year 1973 as a hint – do you think you can work out what I’m talking about today?

MICHAEL BIRD
I mean something to do with networking, but I, I dunno their names…

SAM JARRELL
Well, Metcalf and Boggs were the inventors of the ethernet. And I'm sure back in 1973, they had no idea that 50 plus years down the line, this protocol would still be the dominant wired networking technology across the world.

So, how did ethernet work?

Well, back in the day, ethernet used a coaxial copper cable, the sort of cable, which can still be used to plug an aerial into your TV to transmit signals, and the signals were transmitted using rules governed by the ethernet protocols, which both the sender and receiver would have to have had understanding of.

The message would be split into frames, which would have to include the destination for the frame to be sent in. The source's address, along with ensuring the frames were neither too big nor too small. Think of it as trying to send a book in the mail, but you have to send it a chapter at a time with each chapter being posted in a letter addressed to the recipient with your return sender address on the back, so, okay. Okay. So the reason the delivery address is so important is that the letter would be delivered to every single computer attached to the ethernet, all of them. And the computers would have to check if the message was actually intended for them or for someone else.

Now the problem can arise when lots of people are connected to the same network because if you all want to send messages, they could start to interfere with each other like everyone talking at the same time.

But ethernet has a simple solution for that too, although it is a little bit wordy. It's called the carrier sense. Multiple access with collision detection. Say that five times fast.

MICHAEL BIRD
Really rolls off the tongue, doesn't it…

SAM JARRELL
Yeah, exactly. And it's a long name for a very simple concept. If you want to send a message, the computer will check to make sure no one else is sending messages before it transmits.

It's like, you know when you have a big conversation with a lot of people, one person gets the magic talking stick. It's the computer equivalent for waiting for someone else to finish before you start talking And if two computers try to transmit at the same time, they recognize there's been a collision of messages, and then they wait a randomized length of time before trying again to avoid the same thing happening twice.

Pretty cool, right? There is of course more to ethernet and how it's transformed to meet to the needs of the modern world, which I have no doubt I'll come back to in a future episode. But for now, I think it's pretty cool to think about this technology, which is still fully backwards, compatible all the way back to where it started over 50 years ago, still being one of the most vital protocols in modern society.

MICHAEL BIRD
Yeah, absolutely. And actually ethernet is something which I did briefly chat about with our guest this week. Pral Ani, uh, he's VP of product Management at HP Networking, and he started by giving me an overview of how he got to where he is today.

PRAFUL LALCHANDANI
So I started with Juniper 2015. Uh, but you know, along the way I started to think that networking was getting a little bit boring.
Everything was moving to the cloud, so I had a itch to do some startups. So I went and joined one startup. It turned out that startup got, actually got acquired back into Juniper. but thenmy startup pitch was not really over, so I joined another one which is an edge compute space.
Um, you know, again, wanted to do something small, to do something differently and that that company actually went bankrupt. So I was out there in the market, uh, looking for a job, and my boss from my first startup, who was still at Juniper, reached out to me and said, uh, you know, networking is becoming sexy again.
There's all this AI infrastructure built out that is happening, and, you know, things are moving fast, technology, renovating fast. There's a lot of business to be made over here.
So really excited to be here.
MICHAEL BIRD
what is it about networking in particular that excites you?

PRAFUL LALCHANDANI
It's a fact that it enables, AI the way it does, , you obviously think of GPUs.
GPUs are the enablers for ai, but you know, whether you're looking at , AI job training, which requires thousands and thousands of GPUs. Essentially it's a distributed computing problem, and that distributed computing problem needs the network
Because if you are a cloud provider or an enterprise spending hundreds of millions of dollars on your AI infrastructure, which is mostly GPUs, you don't want the network to slow it down. So the thing that got me excited about networking was that networking became important for enabling this AI infrastructure.
And, frankly, I had seen speeds, 25 gig, going to a hundred gig, very, very slow. And suddenly the last two, three years has gone from 400 gig to 800 gig and now to one point 60. So technology technology's evolving fast. the needs are evolving fast and that's kind of keeps it exciting.
MICHAEL BIRD
So, before we sort of dive into the questions, can we just start by clarifying a couple of terms that, I've, started hearing more of. So scale out, scale across and scale up. What exactly do we mean by that?
PRAFUL LALCHANDANI
Yeah, so good question. So, if you think of a AI cluster, there are racks of GPUs
So all the networking that happens within the rack is called a scale up networking. essentially all the interconnect within the rack, between the GPUs the gpu, treat each other as a single, unified memory address space over which they can do load store operations right now.
One rack is not gonna be enough. You generally have to do training across thousands of GPUs, maybe hundreds of racks. And you need a network that connects all of these racks together, and that's what's called scale out networking. And Michael, one of the things that everybody who's building AI clusters is soon realizing is, what's the biggest constraint in your data center?
It doesn't tend to be space. It then ends up being power. Power is a restraining or constraining factor of how many GPUs can you deploy in a single location. So what ends up happening is that these cloud providers are building multiple data centers that are 10, 20 kilometers away, and that's where scale across networking comes into the picture.
Because you want to be able to deploy that training job across multiple GPUs that don't fit into a single location. But now you need networking across those locations and that is called scale across, networking.

MICHAEL BIRD
And I, and I guess with each of those levels there are different challenges and different ways of solving those challenges.
PRAFUL LALCHANDANI
absolutely. so scale up networking, More bandwidth per GPU. but it's all within the same rack, so it can all be copper.
You know, copper is gonna be cheaper than fiber any day of the week, So if you're using copper, you need to be really, really sure that the signal integrity between your GPU and the switch is really solid.
So there are those kind of challenges out there. when you scale out, you have a leaf spine fabric, and there two technologies that are at play over here. But obviously, coming from a Juniper side of the world, we are looking more at open, standard ethernet technologies to enable that scale out fabric.
And the challenge over there is mostly around handling congestion, right? In a single rack, it's one hop away. You're not really gonna see as much congestion as, as you would in a scale out network where you have multiple flows really competing for the same bandwidth across that scale out infrastructure.
And again, the scale across is distance, right? Once you have distance, you're going outside the data center. you need encryption, You need to make sure that every, every flow that is going across a data center is not in your control. Uh, it it's encrypted. And you also need to have like long distance technologies as well.
So what data center speeds and feeds, like, you know, going maybe up to two kilometers, more likely 500 meters is not enough to carry you across your, you know, data center sites.
MICHAEL BIRD
Fascinating. And, I guess when we're talking scale across and scale out, we are talking fiber rather than copper.

PRAFUL LALCHANDANI
scale across is always gonna be fiber. You cannot go copper across, the kilometers that you're looking for. in the case of scale out, you kind of have choices. the, Decision between copper and fiber is mostly gonna be a toss up between how much distance you need to go and the cost, right?
and the speeds. By the way, we are talking about 400 gig, 800 gig. Copper is not gonna be able to go more than two, three meter distances.
So maybe from the server to the, your leaf switch, maybe copper can do the trick. But if you're going beyond two or three meters, you're definitely gonna need fiber that is gonna help carry, carry you forward, right?
MICHAEL BIRD
Okay. we've talked about AI for networks, and I think what we wanna talk about here is networks for ai.

PRAFUL LALCHANDANI
That’s right

MICHAEL BIRD
What makes networking for AI workloads different from how we just sort of do networking in a normal data center?

PRAFUL LALCHANDANI
Yeah, good question. Look, normal data center, what I call is your traditional general purpose enterprise data center, running enterprise applications. Even today, uh, is mostly done on 25 gig. Per server, right?
And I've seen it be that way for the last seven, eight years. You know, nobody needs more bandwidth for your enterprise applications. Now, fast forward to the world [00:06:00] of AI and even back in like, let's say 20 23, 20 24 with the hopper class of GPUs, you are already looking at 400 gig per second of scale out networking per GPU and there are eight of those GPUs.
So you do the math, it goes into terabits per second, per node. Fast forward to, now and that's driving 800 gig per GPU
So first of all, Michael, the bandwidth needs are insane, like insanely higher, right? And number two is that standard ethernet was never designed to be lossless or congestion free. It was always designed to be best effort. I'll get your packet from here to there. If I lose it, you're gonna retransmit, right?
The, you know, that's the expectation, right? And enterprise applications don't feel the pain. Right. You know? Okay. The, the network remitted it. It no big deal if you start Retransmitting packet, losing packets, or even dealing with a little bit of congestion in AI networks. You reach a situation where, if you're asset training across thousand GPUs, any congestion could mean thousand GPUs are sitting idle
Because until all of them finish their piece of work, they cannot move to the next set of work to do. Right. that's not something you want from your expensive infrastructure.

MICHAEL BIRD
some of these GPUs, we're talking hundreds of thousands of dollars, right?

PRAFUL LALCHANDANI
An eight way system can be roughly $400,000 just for one node. It's easily millions and tens of millions of dollars of GPUs.

MICHAEL BIRD
So you don't want your network to be a bottleneck basically.

PRAFUL LALCHANDANI
You don't want a network to slow that down. Yeah.

MICHAEL BIRD
It's interesting you talked about ethernet, 'cause Ethernet's been around for decades at this point.

PRAFUL LALCHANDANI
Yeah, of course, of course. Before I was born maybe. Yeah.
MICHAEL BIRD
So it, it sort of feels like at this strange juxtaposition between this brand new thing, AI, and we're using ethernet sort of little old ethernet.
PRAFUL LALCHANDANI
but there is a clamoring for more open standard ethernet solutions because people know how to manage and operate those. So ethernet has been picking up like a dramatic way in the last two, three years for that AI interconnect as well. And it's not the standard ethernet, ethernet was not built to be congestion free.
Ethernet was built to be best effort. Uh, but over the last, you know, two, three years, vendors, including HP Juniper, have put in a lot of investment to make ethernet uplevel to be able to meet the performance needs and congestion free needs for AI workloads.
MICHAEL BIRD
Right. And, and I guess organizations have already got skills within ethernet.

PRAFUL LALCHANDANI
Yeah exactly. They're using ethernet everywhere in your campus, your branch, your van, everywhere else.

MICHAEL BIRD
So how, do you ensure that, latency's low enough?

PRAFUL LALCHANDANI
Yeah, so making it low latency is all about making it congestion free. So think of it this way, you know, if you have a four lane highway and the four lane highway suddenly becomes a two lane highway, you know, that creates a backup and that's latency, that's congestion, you know, however you wanna call it, right?
So the way we build ethernet networks, first of all, first and foremost, is you never have that four lane highway going into a two lane highway. It's called a dawn blocking architecture, where if it's 400 gig coming in, it has to be 400 gig going out the other way. That's a start, right? Ethernet can still cause congestion if two 400 gig flows come in and they're trying to map to the same 400 gig link, right?
And that was a problem with standard ethernet or best effort ethernet. But, but some of the technologies that we have introduced over the last couple of years is these switches are now able to look at the congestion in real time and microseconds monitor the congestion of the lanes. And if a new car comes in.
I can see that this lane has five cars and this lane has one car. Let me send this next car that comes in down the lane that has, fewer cars or lesser congestion packets instead of cars, uh, from that perspective. Right?
So eliminating congestion is the same as eliminating latency or reducing significantly latency or.
MICHAEL BIRD
when we talked about sort of AI data centers or AI compute, liquid calling comes in. Liquid calling is always part of the conversation. are we starting to liquid cool networking infrastructure now?

PRAFUL LALCHANDANI
Oh yeah, absolutely. the next question you might ask is, is it needed? look, it's not technology wise, it is not needed.
The question is whether you need it or whether you want it. you can still air cool it, but if you're building a data center, you want your switching infrastructure to be liquid cooled a lot.
the biggest constraint in a data center is power, right? So even shaving off 25% power on a switch and you have thousands of those switches in your data center could be a few kilowatts worth of power. Maybe you can use that and more, put more GPUs on your infrastructure and monetize that.
And frankly, most of the new data centers that are being built out just specifically for AI or just specifically for GPU, they already have a liquid cooling infrastructure for the GPU rack. So then it becomes a matter of, is air cooling actually gonna be harder, uh, because now I need a railroad heat exchanger or, or I need a more expensive cooling infrastructure that are running at more velocity.
So what you're finding is that, as. people go build out new AI data centers that are already liquid. Cool. Enabled. It actually more easier to have your switching rack also be liquid. Cool. Then I don't have to have two different systems. One for my GP rack and one for my, you know, networking rack
MICHAEL BIRD
So, looking forward, do you see networks continuing to adapt as AI workloads, uh, increase?

PRAFUL LALCHANDANI
as long as GPUs are evolving, uh, and requiring more and more speed, I think the technologies is going to adapt very fast. Right. Uh, you know, we just talked about a lot about scale out networking today, but scale up networking is just getting started.
So this is, this is a constantly moving space in my opinion, for the next few years because
MICHAEL BIRD
is there a limit? to how much data we can transfer?
or Actually you know, the sky's the limit at the moment,
PRAFUL LALCHANDANI
maybe like 7, 8, 9 years ago, we were at 12 nanometer Fabrication technology. And everybody started to say, Moore's law has ended. We need a completely new technology paradigm before we can scale this up. Guess what? Last generation was five nanometer and now our one point 60 chips, you know, uh, we are gonna be looking at three nanometer technology, whether that goes to two and one and, you know, sub one nanometer technology, it doesn't seem to end, but I'm not the expert on that topic.
You know, like on, on the chip fabrication technology. But as long as. Chips can be fabricated at more minute granularity or more minute dimensions. It seems like sky is the limit for now. Uh, at least to me. Yeah.
MICHAEL BIRD
Praful, thank you so much.

PRAFUL LALCHANDANI
Yeah, thank you Michael. Thank you for having me here. Yeah.

SAM JARRELL
Wow. That was a pretty interesting interview. Okay. One of the things that really helped me in the discussion to better understand everything was sort of the car analogy that brothel offered up about congestion and the highways, and then treating some of the packets. Like cars and how they view everything.
I wonder how much overlap there is between the way that network engineers think about networks and actual like road engineers. Think about the development.

MICHAEL BIRD
That's such a good point. I really like the uh, and it really helped me to understand when he just talked about like why these networks need to be so fast and actually why it can't just be handled with, you know, standard data center switches and what he said was, yeah, congestion means that GPUs sit idle and you are, you're essentially,
Wasting a resource, a very, very expensive resource. $400,000 for a single node. So you sort of understand actually the congestion from a networking perspective. You don't want that. You wanna make sure your GPUs are being used.
What did you think about, scale up, scale out and scale across? had you heard that before?

SAM JARRELL
I feel like I had heard scale up, but I had not heard of scale out and scale across. the scale across thing was kind of interesting to me, seeing like how far they're trying to scale the GPUs into multiple locations and whatnot. I especially appreciated learning why copper. Can't be used as well.
'cause I had obviously heard of this and I've heard of like fiber optic cables and things for like, you know, cable television, I guess. But I didn't realize how limited copper, it's like two to three meters. That's very short.

MICHAEL BIRD
it's all in perspective, right? I mean, I'm currently, currently talking to you over copper that's more than two or three meters because I live in the, the deep dark countryside.
so you can use copper, but again, it's for the, sheer speeds and like the sheer amount of throughput that these networks require means that, yeah, you have to use fiber, uh, for anything more than two or three meters, which is slightly insane. Um, I, I quite liked, I, I, what I found really interesting was that the conversation about scale across and how actually.
The biggest constraint, I think is what Perel said, and I think we've heard this in other shows, is, is power is one of the main constraints when it comes to, to AI data centers. And actually, if you can spread AI workloads across multiple data centers, the sort of power constraints can be moved around a little bit.
Again, liquid cooling, like we've, we have talked about liquid cooling quite a lot.

SAM JARRELL
Yeah, because it comes up so much, I'm starting to get more and more of a picture of just how much liquid cooling is going to be used in the future. I mean, it already is used quite a bit, but then also, even going back to some of the previous conversations about sustainability and being able to use liquid cooling to like.
Be part of how you absorb some of this heat and then, powering different, like places where these data centers are housed. It's, it's just crazy to me. I think as the, these network scale out networks keep building up, since he said that this is kind of a newer area, we'll see a lot more liquid cooling with these, these different data centers where they need to use the scale up networking architecture.
So this is kind of exciting. It feels like we're, we're watching something come to life

MICHAEL BIRD
Yeah, I, I would agree with you.

MICHAEL BIRD
Now, Sam, obviously we've talked a lot about data centers in this episode, but it's not just organizations that use them. Pretty much all of us interact with networks in our daily lives in some manner. So something I really wanted to ask Pro about was whether our organizations and us as consumers would see these AI networking advances sort of trickle down to, you know, our normal data centers, networks, and maybe even us as consumers in our home networks.

PRAFUL LALCHANDANI
Will Michael get 1.6 terabytes at home?

MICHAEL BIRD
I get one point? This is what I wanna know. Will I get 1.6

PRAFUL LALCHANDANI
that's the main, that's the whole reason for this show, isn't it? Uh, no, I don't think so. Uh, I, I don't think that's happening. I don't think there's a need for it. And like I said, a general purpose data centers as well. It's, uh, still 25 gig, going to 50 gig, maybe going to a hundred gig.
So I don't see that, you know, need, uh, you know, for. Putting those high speed technologies for general purpose data centers, but look at consumers, our lives are already getting impacted.
So I think as consumers, uh, GPUs are getting more powerful, the networking along alongside is getting more powerful. So I think we are seeing the benefits of it. I mean, but we are not getting 3.2 or 1.6 terabytes to the home. For sure.

SAM JARRELL
Okay that brings us to the end of Technology Now for this week.

Thank you to our guest, Praful Lalchandani,

And of course, to our listeners.

Thank you so much for joining us.

If you want to hear more about networking, we’ve also linked in the shownotes to our episode with Rami Rahim, the leader of HPE Networking,

MICHAEL BIRD
If you’ve enjoyed this episode, please do let us know – rate and review us wherever you listen to episodes and if you want to get in contact with us, send us an email to technology now AT hpe.com.

Sam, subject?

SAM JARRELL
Scale up, out, and across.

MICHAEL BIRD
and don’t forget to subscribe so you can listen first every week.

Technology Now is hosted by Sam Jarrell and myself, Michael Bird
This episode was produced by Harry Lampert and Izzie Clarke with production support from Alysha Kempson-Taylor, Beckie Bird, Alissa Mitry and Renee Edwards.

SAM JARRELL
Our social editorial team is Rebecca Wissinger, Judy-Anne Goldman and Jacqueline Green and our social media designers are Alejandra Garcia, and Ambar Maldonado.

MICHAEL BIRD
Technology Now is a Fresh Air Production for Hewlett Packard Enterprise.

(and) we’ll see you next week. Cheers!