The AI landscape often pulls us between the allure of cutting-edge models and the quiet necessity of foundational work—yet how do these extremes actually connect to deliver value?Join Conor Bronsdon as he welcomes Denny Lee, a self-proclaimed "data nerd" and Product Management Director, Developer Relations at Dataricks, to unpack this very spectrum, from AI's core infrastructure to its most advanced applications. Denny explains why robust logging, tracing, and data lineage are indispensable for credible AI evaluation and feedback, ultimately making AI systems more affordable, accessible, and impactful.The discussion ventures into strategies for democratizing AI, exploring the "GenAI ladder" from efficient inference and retrieval-augmented generation to deciding when to fine-tune or pre-train models. Denny also tackles the industry's pressing hardware bottlenecks, the critical role of open standards, and the imperative of navigating data privacy in an increasingly AI-driven world. Listen for grounded advice on moving beyond the hype and making practical, value-driven decisions in your AI journey.Chapters00:00 Introduction and Guest Welcome01:31 Diving into AI Foundations02:25 Importance of Logging and Tracing08:40 Challenges in Data Quality and Lineage14:49 Strategies for Cost-Effective AI19:52 Partnerships and Collaborative Opportunities22:10 Hardware Bottlenecks in AI24:56 China's Power and Networking Advantage25:26 Nvidia's Super Chip and Network Fabrics26:39 The Growing Demand for Power in AI29:26 Practical Advice for Data Governance35:47 Understanding Privacy in AI36:25 Differential Privacy and Its Challenges41:57 ConclusionFollow the hostsFollow AtinFollow ConorFollow VikramFollow YashFollow Today's Guest(s)Website: Databricks.comPodcast: Data Brew by Databricks (available on major podcast platforms)YouTube: @DatabricksLinkedIn: Denny LeeReadSemiAnalysis Blog: https://semianalysis.com/Check out GalileoTry GalileoAgent Leaderboard
The AI landscape often pulls us between the allure of cutting-edge models and the quiet necessity of foundational work—yet how do these extremes actually connect to deliver value?
Join Conor Bronsdon as he welcomes Denny Lee, a self-proclaimed "data nerd" and Product Management Director, Developer Relations at Dataricks, to unpack this very spectrum, from AI's core infrastructure to its most advanced applications. Denny explains why robust logging, tracing, and data lineage are indispensable for credible AI evaluation and feedback, ultimately making AI systems more affordable, accessible, and impactful.
The discussion ventures into strategies for democratizing AI, exploring the "GenAI ladder" from efficient inference and retrieval-augmented generation to deciding when to fine-tune or pre-train models. Denny also tackles the industry's pressing hardware bottlenecks, the critical role of open standards, and the imperative of navigating data privacy in an increasingly AI-driven world. Listen for grounded advice on moving beyond the hype and making practical, value-driven decisions in your AI journey.
Chapters
00:00 Introduction and Guest Welcome
01:31 Diving into AI Foundations
02:25 Importance of Logging and Tracing
08:40 Challenges in Data Quality and Lineage
14:49 Strategies for Cost-Effective AI
19:52 Partnerships and Collaborative Opportunities
22:10 Hardware Bottlenecks in AI
24:56 China's Power and Networking Advantage
25:26 Nvidia's Super Chip and Network Fabrics
26:39 The Growing Demand for Power in AI
29:26 Practical Advice for Data Governance
35:47 Understanding Privacy in AI
36:25 Differential Privacy and Its Challenges
41:57 Conclusion
Follow the hosts
Follow Atin
Follow Conor
Follow Vikram
Follow Yash
Follow Today's Guest(s)
Website: Databricks.com
Podcast: Data Brew by Databricks (available on major podcast platforms)
YouTube: @Databricks
LinkedIn: Denny Lee
Read
SemiAnalysis Blog: https://semianalysis.com/
Check out Galileo
AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.
Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes weekly.
Conor Bronsdon is an angel investor in AI and dev tools, Technical Ecosystem Lead at Modular, and previously led growth at AI startups Galileo and LinearB.
Disclaimer: All views, opinions and statements expressed on this account are solely my own and are made in my personal capacity. They do not reflect, and should not be construed as reflecting, the views, positions, or policies of Modular. This account is not affiliated with, authorized by, or endorsed by Modular in any way.
[0:00] Databricks’ Denny Lee:
Unequivocally, it is important for you to take into account like, what are you doing? Are you running the risk? Well, first things first, record everything. Log everything. So then you know or at least even if you made the mistake, you can backtrack and figure out what the heck you did.
[0:23] Conor Bronsdon:
Alrighty. Welcome back to Chain of Thought, everyone. I am your host, Conor Bronson. And today, joining me is my favorite data nerd, Denny Lee, the product manager director of developer relations at Databricks. Denny joined Databricks all the way back in 2015 originally, and he's also a longtime contributor to Apache Spark, MLflow, cornerstone open source platforms for the machine learning life cycle. He brings years of experience in the trenches of data and AI as well as a curious spirit that has seen him write about everything from coffee to data labeling.
[0:54] Databricks’ Denny Lee:
Denny, great to see you. Welcome to the show. Connor, really appreciate it. Thanks very much. Super excited. What do we wanna dive into, whether it's on coffee or actual, you know, God forbid, we talk about Gen AI and machine learning. Well, I know if I get you started on Taiwan's
[1:09] Conor Bronsdon:
coffee scene or, you know, tea importation, we'll never get back to another topic and let let's definitely start with not on that topic. Otherwise, we'll end on that. Yeah. Yeah. We'll end on that. Yeah. We start that this this entire episode was probably gonna go a little sideways. I have to admit that. I don't know. If if folks in the comments wanna get in here and tell us that that's what we should have recorded, can always come back and do another follow-up. But you know what? In in our pre call, Denny, we were chatting and and you laid out this fascinating perspective
[1:37] Conor Bronsdon:
describing kind of the two extremes of the AI conversation today. So let's start there. On on one hand, there's this critical importance of foundational work, tracing, logging, standards, like your work with MLflow. Then on the other hand, have the push for frontier tech for cutting edge advancements and efficiency, like what Databricks is focusing on with reinforcement learning and making high quality models more accessible.
[2:03] Conor Bronsdon:
And you noted to me that both extremes ultimately aim to make AI cheaper, easier, and more usable. I'd love to unpack that a bit and explore those perspectives and how they connect, starting with that essential foundation of the fundamentals. You emphasize that tracing and logging are crucial to how we approach AI, whether you're an individual developer who's building
[2:27] Conor Bronsdon:
an agent for the first time or a large enterprise. In the rush to build with GenAI, why is establishing this robust logging and tracing from the very beginning so critically important for evaluation and feedback?
[2:39] Databricks’ Denny Lee:
Oh, no. I love your that you bring that up because for a lot of folks, they consider this almost like boring. Right? They're like, okay. You're just logging some files and you're just tracing. And so this goes back to, for that matter, like good old fashioned ML Ops and why we even created ML flow in the first place, which was you and I are guilty of building machine learning models. And in the process of building those machine learning models, we keep the hyper parameters in some like XPEL spreadsheet or CSV.
[3:07] Databricks’ Denny Lee:
And then you have like, you know, the v one of the version of the hyper parameters, the v two, then golden, and then golden v one, golden v two, and then golden go underscore golden underscore v one and then golden and and so forth and so forth. And then you forget which set of hyperparameters were associated to which model. Right? So then you try to remember, oh, wait,
[3:26] Databricks’ Denny Lee:
most of these hyperparameters that I was using in order to run this? Oh, oops. Okay. And then even worse still, what dataset was I actually using? What version of that dataset? So think about it from the perspective that it was already complicated enough just when I'm going doing classic machine learning. Okay? What happens when I'm trying to do with Gen AI when
[3:51] Databricks’ Denny Lee:
I'm looking at the problem where I don't remember what data I was using in the first place? What were the assumptions that I made when I did such a thing? What were the hyperparameters that I used? What was the lineage of the different sources of data? What was the architecture I used? And so forth and so forth and so forth. And so this entire idea of saying, if you're just playing
[4:19] Databricks’ Denny Lee:
and you're goofing off and you're just trying to like you've got like a couple experiments. Sure. Sure. Sure. Yeah. I mean, you probably don't need to really do it in all seriousness. That's that's this this could be literally like I've got a notepad somewhere that I wrote down the numbers and like, this is how I ran it. It's comments in a in my Python script, I'm good to go. Right? But the reality is you're not doing it one time. You're doing it you're doing iterations
[4:42] Databricks’ Denny Lee:
in the hundreds, thousands, millions, especially if you're automating the process. Right. And so then the turnaround statement was like, well, then how do you go ahead and make sense and evaluate which one of these is actually effective? Don't forget, like, if you look at even just benchmarks, and we all know benchmarks have their problems. So we're not pretending
[5:02] Databricks’ Denny Lee:
like, oh, the be all end all. We know that when folks are building foundation models, they're obviously designing them for the benchmarks that are in play. But then for example, when you look at even like the Mosaic research set of benchmarks, that's actually 35 different benchmarks that we're putting together. Let alone LMSYS, let alone the arenas, right? Let alone all these different ones. Also some good ones out of the
[5:29] Databricks’ Denny Lee:
AI too, right? The reality is, how do you track all that? Well, you need to record all this information and you can so you can actually evaluate it. Well, how can you do that? You actually need to record all this stuff and in a format by which whatever tool you're using, it doesn't matter. Like, we're there's always gonna be folks that are gonna come up with a newer idea, better idea, better product.
[5:55] Databricks’ Denny Lee:
And so we actually need a standard around how we record all these things, what information is actually important. So that way, whatever tooling you use, doesn't matter which tooling, like you should have the right to choose whatever tooling you want, you can go ahead and actually figure out, oh, okay. Well, based on the tooling that I'm most comfortable with, I can evaluate
[6:15] Databricks’ Denny Lee:
how well these hyperparameters, how well this model, how well basically, evaluate how well this model's working. Right? And then whether you need to fine tune it, ragify it, build an agent, whatever it is you need to do, be able to evaluate also how well that agent ends up working, how well that rag ends up working. And so the in the end, it's about evaluation feedback.
[6:39] Databricks’ Denny Lee:
Right? And so the most crucial or fundamental component of doing that is by being able to log it. So, you know and again, this is a standard machine learning engineering, AI engineering, data engineering problem. Like, for that matter, even if you want to just go back to good old fashioned running a service software engineering. Right? You need to record it so you can actually make sense of it. Right? And so you can claim it's just, oh, it's an AB test. And that's fine. You can do that if you'd like. But the reality is it's an ABCDEF
[7:14] Databricks’ Denny Lee:
umpteen test. That's actually what it really is. And so, know, again, a bit of a diatribe, and so I'll slow down and stop a little bit. Yeah. It does for us to breathe here. But in the end, we need a standard to make sense of this stuff. And we're hoping, you know, whether it's hotel, whether it's MLflow, whether it's what some other standard. Honest to God, we don't really care which one it is. We're hoping people, like, will definitely check out MLflow to be that standard. But if it's hotel, if it's something else, that's great too. We just want one.
[7:48] Databricks’ Denny Lee:
Or or heck heck, for that matter, a few. Like, we're we're not even that too I need to know the surrealist though. Yeah. But we we just don't want 15. Like Yes. Yes. You know you know the Angel Labs, like, okay. We have 15 standards. Well, let's just create a new one. Now we have 16 standards. Like like, no. We we would like to reduce the number so that way but so there's only a few. And key thing standard.
[8:10] Databricks’ Denny Lee:
There's a standard for us to record so that way, whatever application you build, whatever models you work with, whatever tooling you use, we'll be able to be read this information so you can properly evaluate. That's what it boils down to. Absolutely. This standardization
[8:29] Conor Bronsdon:
of recording and accessing data, is crucial for the health and progress of the AI ecosystem, and obviously, it's a huge reason you've been so involved with MLflow for years now. And then also, have this challenge around data quality and data lineage as well, things you Of course. Kind of touched on here. How does the quality of the underlying data encompassing everything from ETL and analytics to feature engineering fundamentally dictate the ceiling for success even with powerful models like LLMs? Oh, yeah. Yeah. I mean, it goes back to the usual. I mean, the reality is
[9:02] Databricks’ Denny Lee:
this is the classic ML problem all over again, which is you're only as good as your data. Now, there are mechanisms by which we can label or need to label it because we can actually have algorithms come in place to actually pre label for us. But the reality is, even if you have that, right? Even if you tell me that I'm going to generate synthetic data instead of actually use real data, cool.
[9:28] Databricks’ Denny Lee:
Problem is, how do I generate synthetic data? I have to generate it based off of real data in the first place. Because people often think of synthetic data as fake data. I had a really good conversation, for example, Mikhail Katosta, president of Replit. And we were discussing and for that matter, also the same conversation with Jeff Meyer. Gretel's a a great synthetic data company. Right? And but the things like the belief that synthetic data is fake data. I'm like, no. That's the wrong statement. Synthetic data actually allows us to ensure privacy while actually having the key patterns
[10:03] Databricks’ Denny Lee:
that are required for you to generate data that's actually useful for your model. That's actually the whole premise of it, right? And so it's not fake because if it's fake, then you have a completely different pattern. You have a completely different set of math that goes with it. Right? And so ultimately, the problem still comes down to what is the quality of the data that you have. If you've got high quality data, you've got something to work with. If you don't have call high quality data, you don't have something to work with. Right? But then how do you extend from there?
[10:32] Databricks’ Denny Lee:
Right? It's also not just the data that's generated. It's also the feedback you provided. Right? So when you talk about like, hey. Okay. I'm evaluating something. That's cool. Evaluating gets what? I was like, okay, well, I'm just gonna use another foundational model to assess it. Don't get me wrong. There are automations like that that are actually extremely helpful. And I'm not saying don't do it. Right? The old LLM is a judge concept. Right? I love the fact that I'm saying old and it's like, what? Not even a year old or whatever it is. Something like that. Yeah. Yeah. Whatever it is. Actually, by the time this podcast is out, the the people recording, maybe this will be longer, but you get my drift. It's not that in
[11:09] Databricks’ Denny Lee:
in the frequency things, it's relatively relatively short. But to us, it's like, it's been around for a while, so that's fine. And so that's what was the context. Right? Well, sure. But again, it's based off of how good your feedback is. Well, how can another model be the judge unless it self actually had good data in the first place to properly give you feedback. Otherwise, you're getting a lot of really interesting feedback that basically is incorrect. And we've seen that happen time and time again. Right? You you can literally ask models like the old school, like you're asking models for math and it's wrong. Now, most of those models have that fixed, so we're good to go. But then you could ask it to do this, ask the same question over and over again, and you'll literally get different answers. And then some people are going like, well, that means it's hallucinating.
[11:54] Databricks’ Denny Lee:
And then my response often more times is, well, that's why you have an L in the first place. It has to hallucinate a little bit to to have that form of creativity. If you have no hallucination whatsoever, you've got a search clause. All right? So you have to have that balance of how much hallucination you want versus how much you don't. Well, then it goes back to, okay, well, how much do want? Well, that means you need to evaluate it. Well, how do you do evaluate? You need to provide feedback. And how do I do that? I need to record everything in a log coming back full circle. Right? The the reality is we can only do better
[12:26] Databricks’ Denny Lee:
in these systems when we actually can understand the efficacy, the effectiveness of these models. And the reality is for any company, any organization that's out there, it's not like a one size fits all. It's very much in the vein of, okay, maybe company x department a has a very different problem than company x department b. Ironically, x department a is actually much more similar to company y department f than
[13:00] Databricks’ Denny Lee:
company x department b. Right? There's all these type of scenarios. So this is also the reason why open source is so interesting and important because just like we do in open source, in data engineering, which is obviously my background, where we will often, even lower competing against each other, will often work together because we're recognizing we actually have very similar problems. So there's a bit of coopetition where we're definitely gonna compete, but we're actually definitely gonna cooperate.
[13:26] Databricks’ Denny Lee:
And exact same thing's happening with models. Exact same thing as when you're building on top of these models. Right? We actually have very similar problems. If we go ahead and try to do them solo, guess what? It's not gonna be as effective as when we all work together. But again, after saying everything I said, how do we work together?
[13:44] Conor Bronsdon:
Gotta record it. Simple as that. And and that speaks to the importance of data lineage of these open standards. I'll say it also speaks to why we're so excited to be partners with Databricks and to have Databricks both invested in in Galileo and also a native integration. So huge shout out to the Databricks team and and to Denny for coming on the show to talk about that. Let's shift focus a little bit and dive into
[14:08] Conor Bronsdon:
this idea of how do we make AI cheaper and easier to use, finding value faster without prohibitive costs. What what are some of the key strategies or technological shifts enabling this democratization of AI development?
[14:24] Databricks’ Denny Lee:
Well, I mean, I think the first thing is to recognize that, especially if you were just even asking me the question a year and a half ago. Right? A lot of people were saying, well, let's pre train our way out of the problem. And my response is, cool. You have a few million dollars to go ahead and run your own on your a 100 or h one hundreds of the time. Sure. I'm sure you have all that money handy. Right? And then reality, of course, people are like, oh, didn't realize that expensive. That is correct. First things first. Like, if you think about this as sort of a Gen AI ladder, right? Start with the standard inference,
[14:54] Databricks’ Denny Lee:
which is, okay, let me go ahead and take a good solid model and make sense out of it. All right. Then you start saying, okay, maybe I can rackify it or build agents around it. We're basically a surround the model with other systems, Right? So for example, the the scenario often try to explain is like when you I know rag is no longer the term involved, but because it's it is a rag scenario, I forgot I would still use it or Still extremely useful. Yeah. Find find the marketing terms agents now. Okay. Fine. Whatever. But the context is, okay. Like, for example, I wanna ask my LM,
[15:29] Databricks’ Denny Lee:
for my my company's LM, what is the, salary of my CEO? Okay. Alright. So I asked that question. Right. So for starters, if I want that LM to actually have that information most up to speed, that means I'm constantly retraining the sucker to have the latest information. And the thing is it's a p it's a point piece of information that doesn't need to be hallucinated. You can't hallucinate that number. That
[15:54] Databricks’ Denny Lee:
compensation value is exact. So instead of actually going ahead and trying to have the LM figure it out, I'm going like maybe, just maybe you can refer to some other database or a vector store or hell at this point, a lake house, somewhere that solidly says what the actual value is. Perfect. So then that's the combination of saying, okay, I'm having my LM actually refer
[16:21] Databricks’ Denny Lee:
to an external sourcing. That's the old tooling of your LM whatever else. Okay, cool. Perfect. That solves one part of my problem. But then the most other obvious part of the problem is this one. Are you even allowed asking that question? So should I even let the LM receive that question in the first place? Or do I have an authorization that kicks in like, no, you're not even allowed to ask that question. So that question can't even get to the system. So that's why it's a natural evolution or natural step to go ahead and say, okay, just build an agent, build something around it first before you go ahead. So that's going up the Gen AI ladder. Okay, perfect. Well, now you're saying, okay, but no, no. Now forget about the CEO compensation. No, it's about policies, procedures for the company. Okay, cool, cool, Actually, I'll use an even better example.
[17:08] Databricks’ Denny Lee:
It's the coding specific for your company, okay? So you're using like Databricks for example, just because I'm from there and you're connected to Uni catalogs, you can actually go ahead and extract the metadata directly from there. And so you're generating code, that's a really good look, whether it's a Claude or Codeloma or whatever model you're using to generate the code, don't care. All right, fine, perfect. So you're going up the GNAI ladder to say, okay, I agentified it or ratified it, cool. But now I'm saying, I want to build just for me. So let's fine tune this a little bit. Okay, do I go ahead and now just pre train the soccer? Of course not. No, you start with a thousand shot, just
[17:48] Databricks’ Denny Lee:
toss a bunch of questions and bunch of answers right from the get go, see if you can improve it just all its own. And then you slowly go up the ladder to basically say, now do I need to fine tune it? So that's what was the context I'm trying. I was saying, go up the ladder step by step before you think the need to pre train. So that already is giving everybody this understanding that I don't need to have the largest model to do this. In fact, I can probably even take a 5.3 at this point, or 5.4 to do it, because all you're trying to do is you're just trying to improve the model step by step. That means I don't need as many GPUs in order to perform that trick.
[18:28] Databricks’ Denny Lee:
Just as long I have the tooling and I have the data, then I can actually go ahead and improve the model. And then that becomes your IP. And so that's actually how you reduce the cost because from where we're sitting from and also from Galileo's perspective, right, the reality is we don't want you to go ahead and pre train unless you need to. Don't get me wrong.
[18:50] Databricks’ Denny Lee:
If you need to pre train, yeah, we're here to help you with that. Like, don't get me wrong, but let's not start there. Let's start at a basis of saying, what is the most cost effective way for you to actually get the most out of these systems? And ultimately, that's the key thing. Right? And so, again, what you brought this up right before we we started you know, had this wave of conversation. Why did we invest in Galileo? That's precisely the point. Right? We see
[19:19] Databricks’ Denny Lee:
this type of mechanism being the most important for all of our customers. And so provided that we can make it cheaper for you, then boom, that's the value add. And that's why we're talking to you guys. That's why we invest in the other companies we also invested. Why? It's all with that 100% belief that we can make this cheaper so it's actually valuable to you right away. As opposed to like, let's wait eight months,
[19:46] Databricks’ Denny Lee:
pre train our way to the problem. Oh, good. So glad it didn't work. Yeah. And I'll say Databricks has had a really smart partnership
[19:54] Conor Bronsdon:
strategy here, you know, partnering with Anthropic, for example, around approaches to adaptive optimization using reinforcement learning to get high quality models with less data. What's the intuition here of how to unlock the potential for AI, and how is Databricks leveraging these partnerships and these collaborative opportunities that you mentioned, even when there's
[20:18] Conor Bronsdon:
maybe some conflict? It's like, hey. Let's let's build this bigger ecosystem.
[20:22] Databricks’ Denny Lee:
Oh, yeah. Yeah. I mean, I think the key intuition basically is number one, you need high quality data, and you need the mechanism by which to ensure that you have that data and ultimately lineage for that data. You know where it came from, you know what it's going into. That's the first intuition. The second intuition unequivocally is that no, you're not building one model.
[20:47] Databricks’ Denny Lee:
There isn't one Uber model for your company. You're building many smaller models. So the real value isn't to try to take, you know, Lama four, four fifty or whatever the size of that sucker is now. Okay? Right? Or or four forty. Whatever whatever the number is. Or I don't remember. It might have changed the time that comes out. Let's be honest. That's true. That's true. Exactly. That that's fair. That's fair, actually. By the time this podcast comes out, it'll probably a different number anyways. But you get my drift like, it's not about that. Like, this whole reason why deep sea made ever it was so exciting for everybody. And don't get me wrong. I I I think everybody thinking all of a sudden we no longer need GPUs.
[21:25] Databricks’ Denny Lee:
That's also the wrong assumption DeepSeek by every stretch of the imagination. It's like, no, they talked about v two of reinforcement learning, which we actually a bunch of us were already talking about in the first place. They just showcased it very well v three. Hats off to them. They did an amazing job. So no no knock on them at all. This is really, really, really good what they do. I use DeepSeek myself. So yeah, yeah, yes. By the same token,
[21:48] Databricks’ Denny Lee:
the whole point is like, yeah, you could take a smaller model. You can agentify it. You can ragify it. You can then go up the ladder to fine tune it before you ever need to do all these other things. And that's really that second intuition, is saying, no, you're not having one model, you're having many models. That's actually how you solve the problem.
[22:10] Conor Bronsdon:
Speaking of GPUs, what about the hardware side? What are the current bottlenecks that you're seeing, and how does solving for them push us forward? Or or what's the perspective?
[22:21] Databricks’ Denny Lee:
Unequivocally, the reality is we really only have two GPU makers right now. K? We have NVIDIA, which is by far the biggest one. Right? And then we have AMD. And, like, I gotta give AMD a lot of credit. Like, back in December, the semi analysis folks called them out on all the hardware problems and the the software issues that they had. Lisa Su actually literally jumped on with Dylan to talk about like, for an hour and a half to talk about all the different problems. So give AMD credit where there's credit due. Like, they really are trying their best to to catch up and they've made some solid
[22:54] Databricks’ Denny Lee:
strides in their software. Like really, they actually have, I got to give them lot of credit. But let's not also pretend that Nvidia is not way farther ahead. Okay, we know that. Everybody's talking about Nvidia in terms of chips that they're slowing, like, can we produce them fast enough? Right. Okay. So without die going to diatribe or TSMC and the fact that they're already producing five nanometer wafers out of that Phoenix. If you wanna start talking TSMC
[23:18] Databricks’ Denny Lee:
and digging into that, I am very onboard. So so right there interested in Taiwan, let's go. Yeah. Yeah. Let's let's do it. Like so so the the problem unequivocally is like, okay. You've got all the h one hundreds, a one hundreds, h all the different variations of these of these chips that are being made. Right? And then you've got the weird ones that are going to China that are
[23:39] Databricks’ Denny Lee:
which I won't get into. Okay? So but, like, the fact is, like, the Arizona plant already is producing producing five nanometer wafers, which is amazing. I wasn't expecting to spin up five nanometer already. What meanwhile was it Taiwan's already on what? Three look. Sorry. They're they were already doing the three nanometer. Now they're doing 1.5. I I I personally think it's a marketing spiel to call 1.5. For that matter, three was already a marketing scheme, but whatever. But the point is that it's smaller. Okay? The real problem is that when it comes to the
[24:12] Databricks’ Denny Lee:
North American European markets, we don't have enough power. We just don't have enough power. That's Microsoft invested so much and then stopped and then Oracle took over and they're basically running the bank specifically to get Stargate and the Abilene data centers up and running. You know any great nuclear experts we should have on the show? I would love to talk to someone about it. Oh my god. No. No. Right now, they've god. Was it Microsoft when they had and actually are gonna reactivate
[24:44] Databricks’ Denny Lee:
one of the towers from 3 Mile Island? Yes. They are. Specifically, they go ahead and support gonna be three years until it's actually Yeah. Yeah. Yeah. Three years. Again, by you know, depending on when you are listening to this. But Exactly. China doesn't have that problem. China basically produces a shit enough ton power, so the way they announced, like, 10 new nuclear plants last week too. Yeah. Yeah. Exactly. So they don't have they don't they literally produce the in ten years or so, they produce the same amount of power all of America produces. Like, it's insane the amount. So
[25:11] Databricks’ Denny Lee:
they're they're they don't actually need to go down to three nanometer or 1.5 nanometer because China's good at power and good at networking. Okay? That's what they're good at. And and the reason I bring up networking, I love fabric. Okay? Literally is because what is NVIDIA super chip? NVIDIA super chip, like, for example, the Grace Blackwell and their upcoming Vera Rubins.
[25:32] Databricks’ Denny Lee:
It's basically a CPU, Vera, Rubin, their GPU, and then basically a fabric that's like 10x, 20x, whatever magic number it is right now. Faster that allows basically the data stream from the CPU to GPU fast enough. And so this is exactly what the Chinese are good at in terms of building these type of network fabrics. So they can just basically say forget power consumption, we have no problems, we'll just add more chips. That's it. Maybe they're not as fast individually, but in parallel, we have that many more chips all connect by fabric, so boom. And and yes, the it's a power consumption is horror like, efficiency is horrible, but we got enough power, so who the heck cares? Versus in The US, like
[26:18] Databricks’ Denny Lee:
and for that matter, North America Europe in general, the Western what's happening, we have to be more and more efficient because we we have power capacity problems. We can't build the data centers fast enough. We've under invested in actual We've under invested massively. So ultimately, what it comes down to is like, when you look at where we're going, everybody talks about the software.
[26:39] Databricks’ Denny Lee:
I'm not saying that's a wrong thing, but why is OpenAI dumping so much into Stargate in the first place? It's because we're running out of power. We're running out of capacity. And so it's not like I don't oh, I would love to see AMD, like, up the game just so that way we can it's not even about competition, Nosdagon. It's literally about we need more chips. We need more power supply. We need we need more of everything in order to be able to support all this. Right?
[27:11] Databricks’ Denny Lee:
Yeah. And people are talking about the idea of AI winter, and I think the really limiting factor is gonna be energy for the next couple years until we can start getting reactors online and really have this up leveling effect. Like, that that to me is where I'm more concerned than I am about, like, inference. Like, we're we can just keep on like, inference will get cheaper. We can keep throwing things at that. Energy is harder to solve. It takes more time. Yeah. Yeah. Yeah. Absolutely. Even like from inference, like the reality is, especially when you start going to multimodal models. Right? Even inference, like people are saying like, oh, inference, I only need a single GPU. I'm like, well, yeah, if you're doing text, you could probably shove everything. You can put shove an entire booking memory and you can do the inference off that if all I care. But I'm like, okay. When I start needing to do a bunch of windowing functions off of video,
[27:53] Databricks’ Denny Lee:
off of sound, all of multimodal AI, a single GPU is not gonna handle that anymore. Maybe now it can, but over time, no, it won't. So I'm going, okay, cool. So that means I need distributed inference all of sudden or something close to that. Okay, great. Right? So how am I supposed to support that? Oh, I need more power. Except I'm using all the GPUs to train the thing in the first place. I cannot use the GPUs for inference. So exactly
[28:22] Databricks’ Denny Lee:
to your point, like, there's there is a serious issue around power capacity. Before I even talk about chips, before I even talk about software, it's good old fashioned. Like, I don't have racks. I don't have power. That's it. It's a really interesting. By the way, if you learn nothing from the diatribe, log in and take a look at the semi analysis blog. They are excellent about providing
[28:51] Databricks’ Denny Lee:
really good assess on power consumption about like, yeah, if you wanna geek out on power consumption hardware capacity, these oh my goodness. These guys are good at this stuff. Yeah. So unequivocally, if you're into that into that at all, semi analysis is the blog that I definitely would check out. We will definitely link that in the show notes here. They are Please do. Yeah. I want Dylan to like me. So I'm joking. I'll keep to you. Yeah. Dylan, if you wanna come on the show Yeah. Yeah. There you Yeah. You wanna go to show? Yeah. Yeah. Yeah.
[29:19] Conor Bronsdon:
We'll have Denny back. We can have a great conversation, talk coffee and tea first, and then we can dive into everything else power. So let's zoom out a bit here. We've gotten really into the details, which I love. But I think there are folks listening to the show who are saying, okay. This this Danny guy, he's really smart. He he's given me some insights here, but I I'm not sure how to apply that to my team.
[29:42] Conor Bronsdon:
How should I be taking actions to evolve our approach to data governance, to our our privacy,
[29:50] Databricks’ Denny Lee:
to making practical grounded decisions about our about our approach to data and AI instead of just chasing the latest trend, which we're all guilty of sometimes. What what would you tell them? Okay. First of all, thank you for calling me smart, but let's not pretend that I am. Okay? So I just I'm Well read. Well read. I I I'm a well read person. That's all I am. Okay? And if I was the sales guy, then I would just say, hey. Check out Galileo. Okay. Check out Databricks, we're done. No. But by the same token, the also while that's completely true, the the real answer is
[30:18] Databricks’ Denny Lee:
the reality is stop for a second. Figure out what your actual business problem is first. I've had time and time again people come to me, ask me, like, I wanna build this blah blah blah blah, like some gigantic thing. And I turn around saying, okay. Can you tell me the value prop of what you're doing? And their response is like, what I built, they will come. I'm like, look.
[30:44] Databricks’ Denny Lee:
Unless you're Open AI, unless you're somebody huge, you're not building it and they will come. You need to actually tell me what the business value of this thing is. And this is anybody from a department to them running a startup. It's the same difference, honest to God, which is like, tell me I'm a geek, as you can sort of tell. But literally, that's my first question, which is like, it's not the, oh, I wanna play with this tool. Don't get me wrong. I do. Okay?
[31:13] Databricks’ Denny Lee:
But it literally is the first question is, what is your business value? What are you trying to solve? And so don't start big, start small. Start figuring out, okay, I have a specific issue. Then, yes, lineage is gonna come next. And so then you're gonna plug in something like a Uni catalog to do it. It doesn't help that if if you are watching this, I literally wear a Uni catalog shirt too. So just just for fun Yes. Check us out on YouTube for the record. Yeah. Exactly. Just for funsies. Okay? But the the reality is, yeah, you do need to track. You need to know what datasets are being utilized for this. You need to know what data you're processing. You need to know what models you're using. You need to know what the lineage of the models, what version of the models. Right? That's what all this UC stuff is for in the first place. Okay? That's great. And then you say, okay. But now I wanna care about privacy. Okay. What do you mean by privacy? Do you mean security or do you mean privacy?
[32:02] Databricks’ Denny Lee:
If you mean security, it's authorization. Do mean privacy? Are you done with differential privacy? That's a whole other thing not to get into now. Right? Not saying you can't get into it, but now you're actually basically like, with differential privacy, long story short, you're adding noise to the response of every single answer that you're getting. Cool. How does that translate to text? Okay. That's literally another diatribe that will be another podcast for us. Oh, no. I'm gonna get you into privacy immediately after this, Denny. I'm so sorry. Perfect. Well, I get but see that that but that's almost the context. I was like, each step is its own gigantic
[32:37] Databricks’ Denny Lee:
rat holes that you're gonna get into. Yes. And so as opposed to trying to rat hole and cover everything all at once, unequivocally start with your actual business problems. So then you can understand. What is the actual MVP? What's your goal? What you're trying to solve? Now, I don't think what I just get the advice that we we just talked about here is specific to Jenae at all. This is basically any software engineering endeavor, which is stop looking at the whole thing,
[33:05] Databricks’ Denny Lee:
focus on one small thing, do this a bunch of times, get your other teams to learn, then you can start looking at the whole thing when everybody's actually talking the same language. Right? And so this is the exact same play we have to do with Gen AI. We have to because everybody's so going like, oh, this I'm I'm witnessing this really cool research that I'm gonna go run into. That's great.
[33:30] Databricks’ Denny Lee:
Like and we're guilty of it too. I agree sometimes. Yeah. If if you listen to Jonathan Frankel, she's a chief scientist for Databricks. That's smart. From Mosaic. Right? He he he constantly talks about the fact that when we work with these models, like when he looks at the research, he'll tag it, but he'll wait three months. Maybe three months is a little long, but you get my drift. He'll he'll wait. And the whole reason he says he's waits is that he's going like, I don't know what's sticking.
[33:57] Databricks’ Denny Lee:
There's too much stuff. Like, as as a company, in our case, we're Databricks, where we have enterprise customers, we can't just simply jump to what the hottest thing is. We actually need to go ahead and know what's valuable to our customers in the first place. So we wanna know what's valuable for our customer in the first place. We have to see what's sticky and then do the research, have our researchers actually analyze,
[34:22] Conor Bronsdon:
see, is it actually effective? So now it's sticky. Now people are hearing about it. Great. Is it even effective or not? A good example of this is model context protocol recently. Anthropic announced it at what? End of November. Then Correct. It got really hot
[34:36] Databricks’ Denny Lee:
early March, end of March kind of thing, and suddenly it's exploding. It's everywhere. But it took a few months to really set in for people to get used to it. And and then now you can't escape it on LinkedIn. Right? No. No. No. No. I agree. Now now the biggest problem with MCP and don't get me wrong. I'm a I am a backer of MCP as well. But I was thinking biggest problem is now everybody's like, MCP. I'm like, do you know what you're using it for? Yeah. And use the and so goes back to you. Hey, this is actually makes sense for this use case that I have. Exactly.
[35:02] Conor Bronsdon:
Yeah. But it's a lot of fun to say, hey, I have an agent and it's using MCP because those are the terms people are using right now. Exactly. Exactly. Buzzword buzzword's big goal all over again. Exactly. Yes. Yeah. And I think this is I mean, honestly, Denny, you've given us so many great topics of conversation here. Feel like I could spend an hour with you on each one of those. And sadly, we are time limited today. But No. It's okay. The the great thing about
[35:25] Conor Bronsdon:
what you're sharing here probably is any business user, any builder, they can come in and say, okay, this is my specific use case. These are the key points I need to apply from this conversation, and then dive deeper into their next piece of research. And you mentioned something that I I said I wanted to touch on, which is the privacy element of this. We we've kind of briefly alluded to it, but there's this impulse
[35:51] Conor Bronsdon:
often for, and I'm guilty to be clear, shove everything into an LLM and treat it as omnipotent and have it, you know, do the thing for me. And hell, I do this with, you know, first drafts on a blog sometimes. But this approach often comes with limitations and risks around things like access control, potential IP risks. What's your perspective on how we should be thinking about privacy in this current era?
[36:15] Databricks’ Denny Lee:
Okay. So I'll give you a specific set of examples to provide context and we can grow up from there. Okay? So I was fortunate where I actually, back in 2007, I think, I was at Microsoft and actually worked with worked on differential privacy back then with privacy preserving algorithms. And basically long story short, doctor Latanya Sweeney, she developed a paper called K anonymity.
[36:44] Databricks’ Denny Lee:
And the whole premise was that she took the public voting records from Massachusetts and the what was what was considered at the time and may still even be considered currently as PII medical records. Okay? And she joined them together. May she has pay had to pay $25 because she had to get the CD of the voters record. Okay? That's that's that's what she had to do. Okay?
[37:08] Databricks’ Denny Lee:
She combined the data together and she asked three three questions. That's it. She asked three questions and then discovered then governor William Weld's medical records. That's all it took. Okay? And so what she observed was that and this is back in this is like back in like late nineties, early two thousands. Something like I'm I'm probably I'm I'm pretty sure the number's not quite right. I'm because I'm trying to do it from memory here right now. But something like 83%
[37:39] Databricks’ Denny Lee:
of The US population was uniqueified. In other words, I don't wanna say identified, but uniqueified. You can uniquely determine that that is an individual, was uniqueified by simply birth date, five digit ZIP code, and gender. That's all it took. So think about when you have an LLM and you're throwing the entire Internet in there. All it took, like I said, five digits ZIP code, gender, and birth date, and I'm actually able to unicify that individual.
[38:15] Databricks’ Denny Lee:
That is ultimately the reason why when you look at large language models, when you look at people are there's all sorts of people who we have I'm not claiming we're successful at all, by way. I'd be very careful about this. Okay? Like, I hope by the time, you know, people listening to this later time, but it will have been successful. But the reality is because we shove so much data in there, we we actually can't guarantee
[38:37] Databricks’ Denny Lee:
the privacy of an individual. So we actually have to do things like proper data lineage, be able to go ahead and track what hyperparameters we're using, the whole kit and caboodle so that way we can act and this is also part of the reason why synthetic data is actually so important. So that way we can actually try to have some form of control to minimize the likelihood that we're exposing individuals.
[38:58] Databricks’ Denny Lee:
Basically, that's what it boils down to. And differential privacy in its most simplistic form in terms of privacy preserving histograms is more like when you in aggregate are asking a query, say, hey, okay, how many people are diabetic in the state of Oregon in this county? Okay? The answer is some number, let's just say twenty three. Okay? Sure. So we add some
[39:27] Databricks’ Denny Lee:
noise to it. So now the number becomes 26. Okay? And so this is what the Epsilon noise, we just add a little tiny bit of noise. And so the idea is that as you ask questions, more noises added to each to each one. And if you add even more questions, in fact, more noise is actually added. By doing this, there's mathematical guarantees to preventing, to basically,
[39:51] Databricks’ Denny Lee:
protect the individual underneath that data. That's when you're looking at the query in aggregate. Now I wanna emphasize that point. I said in aggregate because guess what? When we're asking LMs things, we're not getting an aggregate answer. We're often getting very specific answers. So now how do you guarantee that? Which means the only way to protect the individual underneath the data is that we actually do things either one, which is to go ahead and generate data
[40:20] Databricks’ Denny Lee:
that actually already protects the individual underneath that data, or two, develop various techniques and algorithms which actually remove information from the LM in the first place. There's various levels of success with doing that. But typically, I prefer the other one, which is basically the data you generate. In other words, if you know that there are three Seahawk fans
[40:44] Databricks’ Denny Lee:
in these eight people. So you still attribute three Seahawk fans, but not necessarily to the same three people, but still across the eight. Right? But they're the Cartesian on that or permutation on that becomes insane. So that's why there's only so much matrix gouge about you can do before this entire thing. Entire statement I just said falls apart, right? But that's more or less the context. The context basically is
[41:07] Databricks’ Denny Lee:
understanding there is a problem. There are some techniques that are actually very helpful. In fact, the Grennel folks and I have been having some very good conversations exactly about differential privacy for for LM. So I definitely would check out the Gretel stuff in terms of what they've been able to go do. But that's more or less the context. I'm saying like, we're still infancy.
[41:30] Databricks’ Denny Lee:
But you as a person who are building these applications or building these models, unequivocally, it is important for you to take into account like, what are you doing? Are you running the risk? Well, first or first things first, just to wrap right back to the beginning of our entire conversation. Record everything. Log everything. So then you know, or at least even if you made the mistake, you can backtrack and figure out what the heck you
[41:56] Conor Bronsdon:
Fantastic. Thank thank you so much, Denny. This has been a wide ranging conversation, but not only have you managed to wrap it up with a nice little bow, but you've given us a lot of different areas to dive in deeper. We'll have to have you back for another deeper conversation on a couple of these topics, and our listeners should have a lot of follow-up material to dive into. So thank you so much for sharing your experience and perspective. It was a pleasure having you on the show. Where can listeners go to find out more about you and about your work? Where can they follow you?
[42:21] Databricks’ Denny Lee:
Oh, honestly. First of all, thank you very much for writing me. Really appreciate it. Our pleasure. Honest to God, you could just find me at databricks.com. Like the the YouTube. The one little shout I would do is the Databrew by Databricks podcast. That's where a lot of our JIT AI stuff comes from in the first place. Great show. Definitely listen to it. Yeah. Please please join us there too. So Fantastic. Well, we will make sure to link all of this in the show notes. Denny, great to see you as always. Looking forward to seeing you in a few weeks at Data and AI Summit,
[42:47] Conor Bronsdon:
in San Francisco with Databricks. And thank you everyone for tuning in to this episode of Chain of Thought. If you enjoyed Denny, definitely check out his podcast with Databricks. And, of course, you can always check out Galileo's podcast, Chain of Thought, on YouTube or your favorite podcasting app of choice. So make sure you have subscribed. Make sure you've liked, you know, maybe given us a review if you love us on Spotify or Apple Podcasts. We always love those, and it helps other folks find our show as well.
[43:11] Conor Bronsdon:
So thank you so much for tuning in. We'll be back next week with more insights in the world of AI. And, Denny, thank you so much again. It's been a pleasure. Thanks a lot. Appreciate it.