The Circuit

With Special guests, Austin Lyons and Paul Karazuba! 

The conversation covers the topic of NPUs (Neural Processing Units) and delves into their architecture, performance, and relevance in the semiconductor industry. The discussion also addresses the use of tops per watt as a metric for NPU performance and the design origins of NPUs, including the use of licensed IP and the evolution from DSPs to NPUs. The conversation delves into the topic of NPUs (Neural Processing Units) and their integration into various devices. The speakers discuss the origins of NPUs, their design philosophy, and the potential impact on consumer devices. They also explore the role of NPUs in edge devices, AI PCs, and the future of consumer use cases for AI. The conversation concludes with predictions about the widespread adoption of NPUs and their potential impact on the market.


What is The Circuit?

A podcast about the business and market of semiconductors

Ben Bajarin:

Hello, everybody. Welcome to another episode of The Circuit. I am Ben Beharin.

Jay Goldberg:

Hello, world. I am Jay Goldberg coming to you from an undisclosed location at a TSL TSA holding cell, effectively.

Ben Bajarin:

Yeah. We think Jay has been obtained by some legal authority, and we're not sure who based on his background if you watch this visibly. We would like to welcome Austin Lyons back to the podcast. Austin, how are you?

Austin Lyons:

Cowabunga. I'm good. I'm coming at you. I'm chip strat on the interwebs.

Ben Bajarin:

That was tremendous. I was gonna work that in if somebody didn't at the end. So so very good. That was a request from a listener to work in a word. In fact, do this every week because we will be happy to say something random in the middle of a conversation.

Ben Bajarin:

We'd also like to welcome Paul Karazuba. Did I say that right?

Paul Karazuba:

You got it. Thanks for having me.

Ben Bajarin:

Yes. Awesome. And, Paul, real quick, may I give a a a who are you, where you're from, what you do?

Jay Goldberg:

Tell them. Introduce yourself to the people.

Paul Karazuba:

Absolutely. My name is Paul Karazuma. I'm the vice president of marketing at Expedera. We are, an NPU IP startup here in Silicon Valley, and I've spent, it will be 26 years on Saturday in the semiconductor industry. And certainly not that I'm counting the days, but it does certainly make you feel old, when you say that.

Ben Bajarin:

Tremendous. Tremendous. Okay. So as as you might have, maybe will hinted at, right, Paul what Paul does for a living is relevant to today's podcast, which will show up in your podcast owner as capital npu, exclamation point, exclamation point, exclamation point, exclamation point, exclamation point, exclamation point, exclamation point. Because the topic of today is NPUs, which, if you're new, stands for neural processing unit.

Ben Bajarin:

This is not a newish term, although people are starting to use it, more regularly, especially people who did not use it before, which was actually a a relevant conversation of a couple years ago. For example, Intel called this their VPU, and many of us were like, stop doing that. Everybody's gonna call it an NPU, so you might as well do so. But based on the fact that these things have vector cores in them, Apple almost never referred to it as this until the iPad launch. This was always Apple Neural Engine.

Ben Bajarin:

And many of us who knew what that was knew it was a neural processing unit, but, they didn't wanna talk about it. And now out into the world is Qualcomm and Intel and AMD. Everybody's saying, we've got we've got NPUs. And, and it's basically the newest chip on the block, even though it's not the newest. It's the newest because there was a CPU and then there was a GPU, and and now there's an NPU, which makes it relevant.

Ben Bajarin:

But that's the topic for today's conversation. So I'm gonna lob out to to our committee the the thesis for why this product exists. We recently at Creative Strategies did a research report on the NPU, and a lot of that work just went in to describe what this is architecturally, the way that it uses a host of cores, other everything from vector, met matrix, and scalar cores in a symphony of collaboration tied to memory that may or may not be on that block. But but the thing that makes the MPU unique, and this is sort of the point that we wanted to make, is that it can process at a much more wattage than the CPU and the GPU core. That doesn't mean that it's always going to be the best place to run an AI workload.

Ben Bajarin:

Just that if you have something that you wanna run, and let's just say that's gonna run for 5 hours in the background, right, give or take, The NPU is a great part of that because it can run tops per watt at milliwatts of of of power. So that so so that's good. So that's the basis. Now the thesis I wanna throw out is, the argue the counterargument to what we're saying is, well, just run that on the CPU and just run it on the GPU. But the point that that I that landed with me in this exercise was CPUs and GPUs have other jobs than to run AI and do dense matrix multiplication math.

Ben Bajarin:

Right? So not that they can't do that, but that they might also be playing a game. They might also be visually crunching and encoding video in real time. Right? The CPU is running system tasks.

Ben Bajarin:

It's got a whole slew of things. So so do we agree with the thesis that having something that's dedicated to this process so that the other cores can do their jobs, they'll do their jobs really well and maybe not be distracted and whatnot, is a a sound thesis to approach the role of the NPU? We'll start with Paul, then we'll go to Austin.

Paul Karazuba:

So do we agree entirely? While a CPU and GPU, as you have said, are fully capable of running an AI network, there's a difference between can and should. And if you're talking about, a battery powered device, for instance, you know, the the difference in battery life can be considerable when you talk about running, an AI network in a dedicated purpose built AI core rather than a more general purpose CPU or GPU. So, yes, absolutely, their NPUs should exist, do exist for a reason, and are more and more becoming a part of, chip designs, everywhere in every market from what we're seeing.

Austin Lyons:

Yeah. I will, keep going with Paul's answer. So if if we step back and ask again, like, what is the purpose of an MPU? Why did we ever create 1 in the 1st place? Why not just use the CPU or GPU?

Austin Lyons:

I think if you look, if you kind of pull back the cover and you look, Ben, you mentioned scalar math, which this is, like, operating on single data, you know, like 42 +2, and vector math, which would be operating on arrays of data or lists of data. And then there's tensor math, so matrix multiplication. That would be a 2 dimensional tensor and then even there's higher dimensions. And with neural networks, the type of math that they do is often vector math and tensor math. Now you can definitely do these type of operations on a CPU.

Austin Lyons:

CPUs can handle vector math. They've got Cindy extensions, but there's limitations, and it's just not as fast as having the native hardware to do it. Mhmm. So sure, you can do it on CPU. It's going to be slower.

Austin Lyons:

You can do it on a GPU. GPUs can support vector and matrix operations. But kind of like Ben said, you know, especially on an edge device like, like a cell phone, the GPUs are being utilized for they're they're graphics oriented. They're being utilized to drive the display. Right?

Austin Lyons:

And so, if if you can instead have an NPU, which would be created geared for AI compute, lots of parallel computations, tensor and vector, multiply accumulate units, designed for low latency and low power, then you can both reduce GPU utilization. You can reduce CPU utilization, like you were talking to Ben and Paul. But, of course, you can also do it in the most efficient way. You can do it on a on a GPU. It's gonna take extra power.

Austin Lyons:

So so that's the way I kind of frame it is is where is the best place to do it? It would be for the circuitry that was designed to do it most efficiently.

Paul Karazuba:

And one other thing perhaps to add to that is when we think of networks, that are running on devices regardless of cloud or device or whatever it may be, we tend to think today in terms of maybe which is, you know, funny to say in AI, maybe the more traditional video or audio networks that one might run. If you think of the comparative size of an LLM that you might wanna run and if you're trying to optimize for something like that, where you have a 100 times the number of operations that you might have on a traditional network, that level of optimization or efficiency that an NPU is gonna give you is going to be dramatic compared to what the same thing would take on a on a on a GPU or a CPU. So looking forward, the, the advantages of an NPU of having an NPU inside of your SoC are just going to be more and more with every release of every new network.

Jay Goldberg:

I I find it I find it kind of humorous that we even have to have this debate or not debate. We even have to have this conversation to validate an MBU, because, like, the whole history of semiconductors is just one long succession of multiple chips being merged into one chip. Right? Like ALUs or arithmetic logic units used to be a discrete chip, what, 30 years ago, 40 years ago, and now we don't even talk about them. Right?

Jay Goldberg:

But it was it was a workload, and we needed something to do it. And over time, we and merged it into CPUs and kinda forgot about it. I think n MPUs fall into that category. It's new functionality. We need something to do it better than the existing solution, so let's just do it.

Jay Goldberg:

But but for some reason, this one is contentious.

Austin Lyons:

Yeah. I want to add on to what what Jay's saying. So, you know, I was looking back at the history of domain specific accelerators. And if you go back to the Intel 8086, it was integer based math. So if you wanted to do floating point math, you had to do that in software.

Austin Lyons:

It was very inefficient. And then the Intel 8,887 in, like, 1980 was a math co processor designed for native floating point arithmetic. And so that was, like, a very early example of a domain specific accelerator offloading from the CPU, you know, floating point math. And and NPUs are sort of like the AI equivalent of that. So this is not new.

Ben Bajarin:

Yeah. But at the same time so so where this got where this got a little bit, or I guess where the where the the lack of clarity came from was it it takes up transistor budget. And so everybody was essentially saying, like, is is that worth it? Especially if that block was to get bigger as you see it get bigger in a in a couple of different, you know, implementations, I would say, Qualcomm's, for example, and then you'll see a few other come out of Computex where you're like, wow. That's way bigger.

Ben Bajarin:

You put way more cores in that than I thought you you were going to. So but that's where I think it was just misunderstood. Right? I think the way that we're digging into it was designed this way is is is the important role. But but just from where I heard people critique this was, even in the early days of some of the big names that we're talking about, we're like, I just don't know how much I should give of it of the diarad.

Ben Bajarin:

I got other things I need to do for core q compute performance.

Paul Karazuba:

I understand that argument, Ben, and I'll come at this from the business perspective. Yes. You know, Silicon real estate is the most expensive real estate in the world. But at the same point, how are you selling your you know, how are OEMs selling their products today? They're selling them as AI enabled.

Paul Karazuba:

They're selling the AI functionality. They're differentiating their products with AI. Chipmakers are differentiating their products with AI. So if you look at a value versus cost perspective, I find it hard to justify why you wouldn't want to put an NPU in your system and why you wouldn't wanna have a very high capacity NPU in your system to talk about how great your self driving chip is going to be or to talk about how great the AI is on your smart phone or how good your data center is. There's just a 1,000,000 business advantages for doing it, especially when you talk about the cost of silicon.

Paul Karazuba:

It's it's not really a debate for most folks. Right.

Ben Bajarin:

Yes. Agreed. Okay. So in turn so so I I'd like a little bit more kinda technical, and I say technical loosely. Like, let's go as deep as we can, but but but make this digestible.

Ben Bajarin:

Because what I keep hearing well, let's let's do 2 things. I keep hearing tops per watt. I don't love tops. Nobody loves tops. We we put in our paper, and I just need to shout this out because I was I came up with this title.

Ben Bajarin:

I was very happy called tops of the morning to you, which is how I describe this section. But all to go on to say, like, look. This is why it's being used, but this is not the best project. So let's start with

Paul Karazuba:

that. Yeah.

Ben Bajarin:

Let's start with that, and then I wanna understand this wattage element of of of math of this math.

Paul Karazuba:

Sure. So I'll start with this. I am an NPU supplier. I will freely admit that I use tops per watt in all of my marketing language that I use, and I will also freely admit and have done so on my website that it is a completely ineffective, meaningless measurement of the effectiveness of an NPU. And we all laugh.

Paul Karazuba:

The problem is is it's the most commonly understood.

Jay Goldberg:

So let let let me let me

Paul Karazuba:

get it. So TOPS is not TOPS is a measure of max times frequency times 2 inside of your device. It has nothing it's not a real measurement of of of the actual performance. When you're when you're looking at tops per watt, and and I've and I've published a blog about this, and I'll repost it on our social feed so folks who are hearing this can can see it. But when you're really looking at the at the effectiveness of the system, you need to understand all the underlying test conditions of where TOPS for what came from, what process node are you in, what frequency are you in, are you assuming integer, are you assuming floating point, you know, what actual network are you running?

Paul Karazuba:

All of that will highly skew the tops per watt argument. I mean, I've seen people who post something wild like 300 tops per watt, which is absolutely unachievable in 99 and 5 nines percent of the case, except for the one little corner case where I have, you know, where I have something that is so optimized for this particular workload that it works really, really well. So, yes, tops per watt is a is a meaningless number, but the problem is it's the number that most people understand or at least can somewhat relate to. So that's why it's used.

Austin Lyons:

Yep. Let let me take it a level deeper. Paul did a good job for and I'll explain really quick. For maybe people who don't know, you know, TOPS is trillion operations per second. And when when he says max, that's multiply accumulate operations.

Austin Lyons:

So kinda at the core of this neural network math, this matrix math, it tends to be multiply and accumulate. And so the 2 comes from sometimes these are, you know, those these are 2 operations and then so why people let's just say this way why people can massage it or how people can massage it is if it's Mac operations times frequency. The question is what frequency? Is it a peak frequency? Okay.

Austin Lyons:

So that's your theoretical max tops, But what is the actual frequency when you're running a particular workload? And then and then even then, that also assumes that all of your compute units, all of your max are fully utilized. So you might ask, well, what about batch size? If if I'm running inference with batch size 1, I'm probably not going to actually exercise every Mac. So you could say, well, what's the achievable tops at this particular frequency at a particular batch size?

Austin Lyons:

And so and then, of course, precision if it's int 8 versus int 4. Maybe you're, you know, you multiply add units in 8, you know, let's just say it takes up 8 bits width. And so if you have int 4, that you could do 2 int 4 operations, you know, side by side in that particular unit. So you could basically double that number. And so that's I think that's why people just say it's it's a very massagable number, because there's all these things you can do.

Austin Lyons:

And even if you go further with tops per watt, the question is, okay. First of all, how'd you get that tops? And then secondly, where'd your watt come from? Is that your minimum power, or is that the actual power that you measured the tops at? So yeah.

Paul Karazuba:

And throw on top of that sparsity compression and pruning. You know? Totally.

Jay Goldberg:

If you're

Paul Karazuba:

if you're doing 30% sparsity, you're going to get a 30% improvement artificially perhaps in your tops per watt number. So anyone who is looking to evaluate the effectiveness of any NPU needs to look under the hood of where all of the assumptions of tops per watt came from. And for everything that Austin has mentioned and for all

Jay Goldberg:

the reasons that Ben has said. So should we stop using it as a metric? And if so, what should we use in this instead?

Paul Karazuba:

Austin, I'll let you answer that one first.

Austin Lyons:

Yeah. Yeah. That's a tricky

Paul Karazuba:

question about So conversation.

Austin Lyons:

You know, as a product person, the question is what matters for the end consumer? And tops maybe indirectly matters, but at the end of the day, if they're trying to run, like, a small LLM, and it it's probably something and and let's just say think, you know, chat GPT locally or something. It's probably time to first token latency and tokens per second throughput. Now it gets hard to compare, but at the end of the day, like, if you gave me, like, a AIPC, I'd want to fire up and just start asking questions and see how quickly it responds. You could call this almost like a vibes check, and I know that that's not quantifiable, but I just wanna see, like, is it does it feel snappy enough?

Paul Karazuba:

Agreed. From from an LLM perspective, time to first token is is you know, time to first token as well as tokens per second are gonna be the key performance metrics. For more traditional networks, I encourage the folks that we work with not to rely on a single number, but to instead, you know, ask suppliers like myself to produce performance estimates in all of the different networks they wanna run. And don't rely on a single monolithic tops per watt number, but tell me the 10 networks you wanna run. Tell me the conditions you're gonna run them in, and then have me and my competitors and whomever it may be give you those performance numbers so you see what the real world is actually going to look like, and you're not basing it on some artificial number that for points made here, probably is not at all accurate for what you're actually gonna use.

Ben Bajarin:

Yeah. I mean, I think there's 2 points to this. There's the no consumer's gonna go run out and buy, hey. I got a I got a computer and it runs at, you know, x number of tops per watt. Right?

Ben Bajarin:

Yep. What what what what matters is what we're getting in the weeds of is really just that, like, it's gonna help your battery life. Like, a more efficient process, something. Right? This is where you're gonna get another couple hours out of your battery life if you're gonna run really dense applications at the edge.

Paul Karazuba:

We all we all remember when computers were sold on on, you know, 3 86, 486, Pentium, and then, you know, the frequency, and then it was the amount of RAM. Very few people buy computers to that anymore if you can even find it. We're looking at how long the battery is gonna last. You know? Yeah.

Paul Karazuba:

We're looking at how many you know, in AI, it's, you know, you know, how fast can I run stable diffusion? You know, how, you know, how quickly am I gonna get you know, how quickly is my local LOI gonna respond to to to to my inquiries? That's that's what's important to the consumer, not ops per watt. Exactly. It doesn't even need to be defined.

Ben Bajarin:

Yeah. I to to to totally agree. I think it like I said, it's useful for us to just say we've proved efficiency with a metric and great. But, but, yeah, nobody cares at the at at the end of the day. I I wanna go back to something, that you guys have talked on around the the the max or theoretical wattage of MPUs.

Ben Bajarin:

Because this this was actually interesting to me because we've we've been benchmarking a bunch of these Apple's m series, you know, Qualcomm's, etcetera, Intel's, and and you see a range of frequencies. Like, one might be a 10 watt max. 1 might be, you know, an 8 watt max. And so I'm kinda curious how how that works. So so one, maybe at a technical level, either Austin or Paul or both of you guys are saying, how how how how much wattage can I peak this at?

Ben Bajarin:

Like, could is that a is that a set number, or could I be like, you know what? I want a 30 watt max NPU, or is that is it not designed to handle? So I'm just curious, like, how variable is that number? Who sets that number, and are there theoretical limits where you're like, okay. Don't do that.

Ben Bajarin:

Then just run it on the CPU, for example.

Paul Karazuba:

Oh, I would say that there's there's no real well, it depends on what the application for your device is. If you are building something for a non data center device, you do not wanna do any sort of liquid cooling, which is gonna put your, you know, your max power consumption of your chip, let's say, 65 or 70 watts before you're gonna need to do some sort of some sort of active cooling. But as far as the the the the watt consumption of the AI portion, it really just comes down to the chipmaker and what they might be wanting to handle or what their power budget might be in their system based on what their customers expect or what their users expect.

Ben Bajarin:

Okay. So it can

Austin Lyons:

go Yeah.

Ben Bajarin:

I would

Jay Goldberg:

basically. There it

Paul Karazuba:

it it can. I mean, you you the the the silicon can handle it. It's it's how are you gonna package it, where are you gonna put it in the system, and and and how is it gonna be run. That's really the question of how hot it could really get.

Ben Bajarin:

Gotcha.

Paul Karazuba:

I mean, there are, you know, you know, the h one hundred, which, you know, the the the the darling of of of the training industry for for obvious and justified reasons, you know, that's that's a 70 to a 100 watt at least chip. Granted it's in a data center, so they have the ability to cool it, but that's not stopping that from being super successful. That same chip stuck in a mobile phone, it's probably not gonna be particularly successful.

Ben Bajarin:

Yeah. Okay.

Austin Lyons:

Yep. Totally. I was just gonna add. I I would assume that the power is a design constraint that they're designing around. So, you know, they're kinda choosing frequency and power, for a given thermal characteristic that they care about.

Ben Bajarin:

Gotcha. Okay. Well, I wanted to use that as a segue into, because Paul mentioned data center, so I wanna get there. But before I get there, let's let's perhaps bust open a myth that all of these NPUs are proprietary and homegrown IP, which is kind of what everybody wants you to believe. You know, if if you're everybody out there, they won't they won't say, like, we've built this on a, quote, unquote, microarchitecture that's standard, and so, therefore, we did all of this.

Ben Bajarin:

Like, we invented our MPU. Now, you know, when you start looking at that, you're like, okay. Well well, where'd you get your tensor cores from? Where'd you get your matrix? Where'd you get your back?

Ben Bajarin:

And and so, obviously, maybe they acquired somebody back in the day, and, you know, he's got it. I think in AMD's case, some stuff came with Xilinx. So there's there's parts of it. That that's great. But let's just sorta dispel the myth that this is, like, some brand new evergreen creation of IP.

Ben Bajarin:

These companies are using IP that was got from somewhere, whether licensed or acquired? Paul, you started. Jump to Austin.

Paul Karazuba:

So there are apps so there absolutely are companies that have organically created their own NPU. They are very likely big companies. Designing an NPU is not a trivial task. Designing one that functions well is incredibly difficult. But, yes, you know, one of the secrets of the chip world, an an open secret, I should say, is not everything on chips has been designed in house by the manufacturer that you see on the outside of that chip.

Paul Karazuba:

It would not be uncommon for parts or all of NPUs, for instance, to be licensed from external suppliers wrapped into a larger SoC and marketed as internally created. NPUs, this is a this is a weird word to use for any semiconductor, but NPUs are sexy. They're they're they're they're interesting. They're considered to be absolutely state of the art. And to say that we didn't design that ourselves, at minimum, creates a little bit of a public relations and what are you doing.

Paul Karazuba:

And at maximum, it says, does this shimp company really what exactly do they do? So, yes, there are absolutely NPUs that are licensed from other people that are that are relabeled, and that's and that's fine. That's the way this industry works. And there are stuff that is created organically. I wanna be very clear about that too.

Paul Karazuba:

Not everyone is licensing them from someone else.

Austin Lyons:

Yeah. Two examples. When I was kinda researching where do NPUs come from and what is their history, you know, I think 2 big company examples. It seems that several people started with DSPs, digital signal processors, as an example of a domain specific accelerator on a chip. So Qualcomm Hexagon, for example.

Austin Lyons:

You know, they basically said, hey. This DSP, can do vector math and it has, you know, high sort of parallelism. It was designed for low latency. Let's just add tensor support to it and now it will be able to run AI workloads. And so it seems that they took that approach Qualcomm and Intel took the same approach.

Austin Lyons:

They they acquired a company, Mobidius. Mobidius. And they have this, yeah, shave DSP. And so it looks like, Intel did the same thing. They took this DSP, added a bunch of just matrix, multiply, and accumulate units, and, you know, probably massage the memory layout and everything, and they call it an NPU.

Austin Lyons:

And so I think these are just two examples of companies who started with something. 1, I think Qualcomm says that they created theirs and then Intel bought theirs and then they sort of like morphed it or iterated it into an MPU.

Jay Goldberg:

Has has Intel said that explicitly? They said that our MPUs is from NVIDIA, or is that something you sort of pieced together? I mean,

Austin Lyons:

if you well, if you like chips and cheese, a really good website. If you look, they've done some work at talking about it. So I don't I I can't say off the top of my head if Intel explicitly said that or if it was pieced together.

Ben Bajarin:

Well, it's because they've when they called it a VPU, they were pretty clear that was Wavidea's IP, and now that's just become an NPU. That's just a renaming of it. So I think without saying it because, again, like like, the point I'm bringing up is nobody wants to really say where this came from. I I would imagine that Apple's was very similar. Their homegrown DSP evolved into bits of this, and now it is, you know, a a A and E is is what it is.

Ben Bajarin:

There's other people where I that I wanna talk about next where I don't I don't know where they're getting it from, and that's totally fine. We we we may not. But but my point was everybody wants you all to everyone to believe that this is like, yeah. We we came up with this thing. You know?

Ben Bajarin:

The the thing I wanna say this, though. The thing I think is super interesting about this because, Paul, you mentioned sexy, and and what intrigues me me personally about this is everybody's NPU is different. It's like a snowflake. And and from a design standpoint, this shows a lot of creativity, but also philosophy for the company. Right?

Ben Bajarin:

Apple's gonna approach this because they'll have a philosophy, as does anybody doing these at a at an in independent accelerator that's not on the block. Or, you you know, it so they're all gonna be different, which makes it interesting to me because that will be very telling of architectural design decisions that we just don't normally see in a CPU, GPU type of a thing. Right? You just throw more cores at it. Right?

Ben Bajarin:

Or frequency scale. Great. This is a totally different thing where we actually get to see designers design, and that that's what I find interesting to analyze. So that's Well my my plot of why I think it's interesting.

Jay Goldberg:

I see. I I think it's it is it is very it's been really enjoyable for me to watch this because it's there is I I try to try to put this diplomatically. Like, I think Qualcomm should get more credit because they took their DSP, and I think they've actually said this publicly. I think there's a blog post about it of how they rethought their DSP for use as an MPU. And, like, like, DSPs were once core to Qualcomm.

Jay Goldberg:

Like, that's that's a big part of building a modem is that DSP. It goes way back deep into Qualcomm's roots, and now they've sort of modernized. And I think that's that's that's a really good story to tell. Movidius, I mean, Intel acquired Movidius 2016, and it kinda it's that's kinda older stuff. And and I'm just wondering, like, how how many of these companies are, just repurposing old stuff, like maybe Invidius' Mobidius' for Intel.

Jay Goldberg:

How many people are designing it new from, you know, serious upgrades to older things like Qualcomm Duds? And, like, is that a right the right approach? I mean, do we do we need to go back to first principles and rethink how we're doing NPU cores? I think that, you know, I I don't know. I think that that approach needs to be explored.

Jay Goldberg:

What do you think, Paul?

Paul Karazuba:

I have a biased answer because my company went back to day 1 and rethought, what we believe an NPU should look like. Oh. There are a lot of other folks in the IP industry who, you know, I I say somewhat sarcastically that they just basically took a warmed over GPU, CPU, or DSP and created an, you know, created an NPU out of it. Yes. You know, if there was one correct approach to doing this, there would be one architecture in the market, and there's not.

Paul Karazuba:

There's multiple architectures in the market. We're all doing our best to make sure that we handle as many different kinds of networks as we possibly can. I think the answer to your question, though, Jay, is really time will tell. I feel strongly that you need to build your engine to you know, if you're gonna build an NPU, make it as good as possible at processing neural networks and don't take any of the baggage of perhaps past usages of that core technology that then became an NPU. We built ours from the ground up.

Paul Karazuba:

We built ours for the sole purpose of processing neural, you know, you know, neural networks. That is what we believe that's what Expedaira believes to be the best approach. I believe we're right. Perhaps we'll

Jay Goldberg:

be proven wrong, but I believe that

Paul Karazuba:

that's the right way to do it.

Ben Bajarin:

Jay, we needed a bit more of a comment, by the way, though. I just Yes. Please. Your optimism is just it's hurting me now at this point.

Jay Goldberg:

Well, I don't I don't I think like, I well, I don't wanna name names. But, like, for the last 2 2 years, like, like, you and I have been going to all these conferences, and we talked to everybody. And we asked them like, oh, you got that that fancy look at d p MDU there. Tell us about the design. And we're always met with this awkward silence.

Jay Goldberg:

It's just the weirdest thing.

Ben Bajarin:

Blank stares. Blank stares.

Jay Goldberg:

It's yeah. I mean, like, I I almost wish they didn't know. I'm I'm, like, I'm, like, do they not know, or do they not wanna tell us? It was no.

Ben Bajarin:

It was like the forbidden question. You're like, oh, shoot.

Jay Goldberg:

Well, I mean, let let's let's be

Paul Karazuba:

fair, though. Anyone who's standing in a trade show booth or anyone who's standing in a technical conference has been media trained. They've been trained to not answer questions from folks like you specifically because for that exact reason. So if you got an answer out of them, I would be quite impressed by your, by your questioning skills.

Jay Goldberg:

Yeah. Well but you're right. I mean, these these are a lot of these are analyst specific events Mhmm. Where you have a whole bunch of very nerdy people asking very technical questions. And so it is it has been glaring to me that it's not even a question of media training.

Jay Goldberg:

It's just, like, silence. It it You know? It would get back to you.

Paul Karazuba:

It could be just lack of knowledge in the subject. And, you know, if you if you don't want a secret to get out, don't let the secret get out. Don't tell anyone.

Ben Bajarin:

Right.

Paul Karazuba:

So yeah.

Austin Lyons:

I I if if I was just gonna throw out there, that if we if we go to first principles, in theory, you know, designing it from scratch, you can make the right trade offs that you need for neural networks. Now the question is, when these companies took existing, say, DSPs, what design trade offs were made and were those a limiting factor for their NPU at all, or were those decisions okay and didn't prevent the NPU from sort of reaching its fullest potential? And that's just kinda what we don't know.

Jay Goldberg:

Right. But that that's a great question. Like, that's a good question. And I'm I'm really I personally, I want I'm asking these people because I wanna hear their their thoughts on this, and I wanna hear, like, oh, you know? Sure.

Jay Goldberg:

Right? But

Ben Bajarin:

But I I I think it goes back to everyone wants everyone to believe this is secret sauce, and a lot of times it is. You know, it is because that is that is a an approach that they're taking that is their unique design. So so I get that. I just I would love I I we ask this because I would just love more understanding on what they're doing, and and you will see this. So maybe in a few months, we should revisit this because after Computex, a lot of people are gonna come out and and give, you know, chips and block diagrams of what's doing what.

Ben Bajarin:

And I think that'll be interesting to talk about who's doing what. But it's it's starting to come out at least at the SOC layer. There there's another part of this discussion where I don't think anybody's gonna tell us that I wanna get to that's at least vaguely interesting. So so so so so let's jump there. So I had a recent conversation with

Jay Goldberg:

because because because you you said you said it's secret sauce. Right? My you said it's you you described it as secret sauce. My contention is that it's not secret sauce. It's 3 it's 3 raccoons in a trench coat.

Ben Bajarin:

There's my comment. There's our curmudgeon comment. Thank you, Jay. I was desperately needing this. 3 raccoons in a trench coat will will be our code word when we when we're sitting in a word and nobody gives us a a straight answer.

Ben Bajarin:

We'll say, it's just 3 raccoons right there, man. Okay. So I I was having a conversation with a hyperscaler who makes a custom ASIC for AI acceleration. And I basically said, give me some details, and they said no. So I said, okay.

Ben Bajarin:

If I was to think that this thing had some vector cores and some tensor cores, would I be in the right direction? And they said yes. So I was like, so theoretically, this big square block that's, you know, maybe 2 and a half inches by 2 and a half inches is a giant discrete NPU. I think that's interesting. I want you guys sort of lob out if that's kind of the way the thesis that we would think about.

Ben Bajarin:

Because I'm sort of saying, 1, none of them are gonna give us you know, Amazon's not gonna tell us what the microarchitecture for Trainium and in French is gonna be. Microsoft's not gonna tell us for Maya. Google's been pretty clear. It's tensor core is great, but they're probably not gonna go much, you know, much beyond that. So but but is that a good way?

Ben Bajarin:

Is that even the right way to think about it? That these are giant kinda NPUs. But and I asked the wattage question because, again, this in a data center, may maybe it can scale higher, but they want these to function as that block that's purpose built for AI workload. So that that's essentially the train of logic I got to where I'm at. So I'm throwing out if that's maybe the right way to think about, you know, my, Trainium, Inferentia, and Google's TPUs.

Austin Lyons:

Yeah. I mean, my initial reaction is that is a fine way to think about it as a a discreet NPU. I mean, people tend to call those AI accelerators when it's in a server and then they call it an NPU when it's on the edge. Right. But I think, yes, you are some, like, totally correct to say it's just compute built with the correct data type, you know, math, vector, matrix, scalar for neural network workloads.

Austin Lyons:

So isn't that an NPU? Totally.

Ben Bajarin:

Theoretically. Yeah.

Jay Goldberg:

Paul. Can can we can we have this debate real quick, though? Is is an NPU a discrete chip, or is it a block in an SoC?

Ben Bajarin:

Oh. See, we would have need to have defined this in our definition part, which which we didn't fully define. Because in this case, right, we're leaving. But but and this is the actually, this is the this is a great question. It's the first time I think people I have not heard it asked before.

Ben Bajarin:

So yeah. So the but I don't think it's ever been defined such a way that and any anyway, I'll let you guys you guys chime in first.

Paul Karazuba:

If if I'm gonna freestyle the answer to to to to Jay's question, An NPU is a block and an SoC, and an AI accelerator is a dedicated AI processor or coprocessor, dedicated piece of silicon, let's say. But that's just my Funkmaster Flex freestyle.

Austin Lyons:

Yeah. I guess the the counterargument might be like, well, is that chip, Maya in Francia, does it have a CPU on it, like a little ARM CPU? And so if it's got a CPU and a bunch of Macs, is that really different than your Edge NPU?

Ben Bajarin:

I don't know. I mean, a GPU is a GPU, whether it's discrete or integrated. CPUs, generally just its own thing anyway because that's where this whole world started. So I sort of just default to, like, you could take it off, you could take it on. I could put a DSP on, I could put a DIGA DSP off.

Ben Bajarin:

I could put memory on, I could put memory off. Like, it's the if the function is the same, that's sort of why that but that's why I I a 100% don't think anybody in making a custom basic is gonna call this an NPU. I just my annoyance was that it was so ambiguous that no one had any idea what it was doing that I was just like, we at least need to think about this somehow. And this custom built thing, even if it's a co processor that just handling AI workloads, makes sense to think philosophically like we think about NPUs. Was was that was the train of of of logic.

Jay Goldberg:

Totally. Yeah. I mean, my my money is my money my sense is increasingly MPU is is is the block inside the SOC.

Ben Bajarin:

That's how people refer to it. Agreed. Yeah.

Jay Goldberg:

I think that's I think that's where we're we're landing. But

Ben Bajarin:

Yeah. Only because, again, nobody's nobody's gonna call it that. And unless unless the wild card to our world is in 2 weeks, Apple goes, we're making hyperscaler chips. And, oh, by the way, we've turned ANE into its own engine, and it's an NPU. Like, they could they could screw everybody up.

Ben Bajarin:

But but regardless, that that's a different conversation if and when we reveal we learn about Apple's data center efforts. But I would tend to agree only because nobody's gonna call their giant chip. But I did ask the wattage question in an attempt to make perhaps make a distinction. If you told me that, theoretically, you can only throttle these to a certain amount, then I would say, cool. Then it probably is always gonna be on.

Ben Bajarin:

But if somebody could make an MPU and it throttles to a 150 watts, then that's not theoretically impossible, then it scales. It scales scales to data center co processors.

Austin Lyons:

I have a question for the group. Going back to, like, are everyone's NPUs different, or are they similar? Do you think NPUs, let's let's talk edge or AI PCs, will those be a point of differentiation, like the the hardware itself?

Paul Karazuba:

Will it be a let me make

Austin Lyons:

sure I'm answering your question. Yeah. Oh, I buy these laptops because they have the ANE and its weight or or whatever. They have so and so's NPU in it, and I just it's always snappier than, you know, kind of like an Intel versus AMD, like I'm buying it because of the NPU.

Paul Karazuba:

I think consumers, for the for the most part, are are more accustomed to buying what a device is going to do for them rather than a code name of something that's inside of the device. Let's just say whether it's an x y z processor or or whatever it may be. Marketing is skewed toward usages. You know, it's it's skewed toward how is this device going to make my life better, easier, quicker, faster, whatever the words you want to use is. I'm not sure that I would see there being, from the consumer point of view, people buying because it contains a particular chip containing containing a particular function that's different, containing a particular use case that is unique to a a specific manufacturer or OEM, absolutely.

Paul Karazuba:

Going to you know, absolutely could see that as a buying decision. But as far as, you know, just simply containing a a chip from someone, I don't know about that.

Austin Lyons:

So if I buy a Copilot Plus PC with the Qualcomm, like, you know, Snapdragon in it, and it has more interesting or better AI functionality than a different OEM, would that be differentiation then? Even if the consumer doesn't know why.

Ben Bajarin:

So I'd say I'd say it this way, though. Let's just say, hypothetically, vendor b, because I don't wanna name anybody, comes out with Copilot plus PC, and it gets horrible battery life because it just wasn't designed. Now the consumer is not gonna be like, oh, vendor b is really bad. They're just gonna be like, those things don't get 25 hours of battery life. This one does, and that's a pain point to me.

Ben Bajarin:

They're gonna say, okay. I'm gonna prefer the one that gets I'm gonna go toward the one that got better battery life. Right? Because but media will that perception will come out through through whatever. But that's to Paul's point, that's the end result of the implementation that leads to the use case or the value proposition that you care about.

Ben Bajarin:

So that's what they care about is the end. The secret sauce in the middle just gets to that. And and if a vendor doesn't do a good job of implementing this, then that's definitely gonna real impact real world performance, which will then skew people toward their buying decisions. That that's that's how I would look at it.

Jay Goldberg:

I I think it's I think it's even worse than that. Consumers will not care about any of these features. It is not differentiated to to them because there are no good consumer use cases for AI right now. Right? At least none that consumers Yes.

Jay Goldberg:

Are aware of. Right? Maybe some of these Microsoft features take off. I don't wanna knock them. But, like, unless there's some surprise hit there, consumers aren't gonna care.

Jay Goldberg:

So they're not gonna care about the AI features in the phone. Conversely, if you're running an AI PC and to your point, Ben, if it if it's really bad performance because of the chip, the chip maker is not gonna get blamed. It's going to be 5050. It's either gonna be the OEM who made the laptop or it's gonna be Microsoft because consumers know both of those. And so right?

Jay Goldberg:

And if if consumers thought, oh, this this was a recall a recall feature No. It's it's it's, like, really if if it's, like, stutters, people are gonna blame Microsoft. If the battery life is terrible, they're gonna blame HP or Dell. Right?

Ben Bajarin:

Yeah. I would I would I would agree with that.

Jay Goldberg:

But but but the the flip side of that is if we can actually get a really good consumer use for AI PCs and one of the chip vendors has a real advantage there, then that's their opportunity to build the Intel Inside brand or Intel to reinforce the Intel Inside brand. Right? And I think Right. You know, that that's Qualcomm's big opportunity here is if right? If if somewhere down the road, some great feature shows up and they have a big advantage, then then they go spend $1,000,000,000 on a consumer marketing campaign.

Jay Goldberg:

It'd be huge for them.

Ben Bajarin:

Right.

Jay Goldberg:

But I'd you know, the the problem is by the time the time we get that feature, there's a decent chance Intel and AMD will catch up. And so

Ben Bajarin:

Right. Right. With with the exception that and and I and I would agree that there'll be more parity than not. However, this one block is probably the one that for at least the next couple years we'll see the greatest year over year performance on as they throw more transistor to it, as they throw more matrix or tensor cores at it. And I don't know what that's gonna yield yet, to be honest with you.

Ben Bajarin:

Like, we could be building something that no one gets used and transistors are left on the table. Maybe. I don't know. I'm just saying, like, that's the one like, we always scrutinize. I wanna see IPC games.

Ben Bajarin:

Gen on gen, like, it's it's just really hard. However, AI compute gains, we're gonna see that at the edge. And and now it's just like, to the whole conversation, developers need to absorb that software. Right? In the same way that throw more GPU compute in the cloud and all of a sudden the model gets bigger and they chew up every single one of those tops and flops.

Ben Bajarin:

We need the same thing at the client edge that give them more compute. In this case, NPU compute. Let developers go wild. Let's see what happens. We don't know.

Ben Bajarin:

But that one block is gonna see drastic performance year over year because everybody's investing in it. And it is the one thing you can get a whole lot more compute at at the edge. So we'll see where that goes. But, yes, we are still all fishing fishing for these wonderful use cases that really most normal consumers don't even care about at the moment. So, anyway but that's the the nitty gritty of of that bit.

Ben Bajarin:

Okay. Any let's just throw some parting thoughts out to the subject since we're almost at our longest podcast ever, but this was a rich discussion. So I'm sure everybody will enjoy it. Last sort of sort of sort of thoughts, if you will, on the the NPU as a whole. And let's just say specifically, can we make any kind of prediction at this point about where this goes other than what I just said, which is it's gonna get more compute?

Ben Bajarin:

Is there anything else we can say, like, watch for this as SoC vendors evolve this strategy?

Paul Karazuba:

I'm gonna make my prediction that you are going to see NPUs on 90 plus percent of chips other than really small discrete chips within the next 3 to 4 years. Mhmm. You're gonna see them absolutely everywhere, of varying sizes, of varying capabilities in almost every market. You're gonna see them in refrigerators. You're gonna I mean, you've seen them in smartphones.

Paul Karazuba:

You're gonna see them in cars. They're going to be absolutely everywhere.

Ben Bajarin:

Great prediction. Alright.

Jay Goldberg:

Oh, I'll go I'll go a step further on that. I think you're even gonna start seeing them in some pretty small chips too. Maybe not full blown crazy ones, but, like, decent sized decent sized ones and microcontrollers. I think we're that that's coming sooner rather than later.

Austin Lyons:

Mhmm. My prediction, and for the record, if I'm wrong, this is Austin from Chip Strap, is we can check on this in a year. I think multimodal LLMs plus NPUs will create a very interesting proliferation of awesome use cases for hands free edge devices. So think, you know, like, smart glasses, like what we had hoped Google Glasses would be.

Ben Bajarin:

Right.

Austin Lyons:

Right? Because now they can understand visual input. It can understand your natural language way better than Siri ever could. And therefore, I think, and and by the way, NPUs bring the cost to a developer of inference down maybe to 0. They don't have to rent or buy NVIDIA GPUs.

Austin Lyons:

They've just got it right there. And so maybe we'll see tons of little tiny consumer edge, apps being built over the next couple years. And maybe, maybe a year from now, I'll be wearing a pair of those for our our file. I

Ben Bajarin:

hope so too, dude. I I, like, want my meta glasses to do all that stuff already, and I'm annoyed they don't because theoretically, it should. Maybe just not on the device, but but to the cloud. Like, do these things. So so yeah.

Ben Bajarin:

Alright. So my my wild one will be and this this is gonna this might rub off feathers of some people who I know listening to this. NVIDIA will make an NPU and either put it on board and or integrate that onto perhaps a client SoC, but a company that has no NPU at the moment and is the AI king, will will make one in the not too distant future. So there we go. Well, gents, Austin, Paul, thanks for, thanks for joining.

Ben Bajarin:

Austin, everyone can find you a Chip Strat or, yes, Chip Strat substack.

Jay Goldberg:

Yep.

Ben Bajarin:

Yep. And and, again, I said this before, the best Twitter handle, the Austin Lyons, l y o n s. And, Paul, maybe quickly, you know, we mentioned your company. Just give it where where can people find you, read your stuff, your company, etcetera?

Jay Goldberg:

Sure. You can find my

Paul Karazuba:

company at expedera.com, expedera.com. You can find me on LinkedIn, and that's the the best way to get in touch with us or see what we got going on.

Ben Bajarin:

Awesome. Thanks for listening, everybody. Appreciate your comments, And, until next time, cowabunga. I got it done twice.

Jay Goldberg:

Thanks, everybody. Tell your friends.

Paul Karazuba:

Thank you. Bye.