Built This Week is a weekly podcast where real builders share what they're shipping, the AI tools they're trying, and the tech news that actually matters. Hosted by Sam and Jordan from Ryz Labs, the show offers a raw, inside look at building products in the AI era—no fluff, no performative hype, just honest takes and practical insights from the front lines.
I built an LLM map roaster. You put in the math problem. It'll score it, roast it, and we'll go through that in just a second. You can actually submit a custom problem. So if you have your own problem, you can simply do that.
Jordan:It's awesome.
Sam:Hey, everyone, and welcome to built this week, the podcast where we share what we're building, how we're building it, and what it means for the world of AI and startups. I'm Sam Nadler, cofounder here at Rise Labs. And each and every week, I'm joined by my friend and business partner cohost, Jordan Metzner. And this week, we have a special guest from Acxiom Math dot ai, Karina. Hello, Karina, and welcome to the show.
Sam:If you don't mind, give us a a one minute intro about who you are and what Acxiom Math does.
Carina:Hi, Sam and Jordan. Hi, everyone here. I'm Karina, founder and CEO of Acxiom. At Acxiom, we are building a self improving super intelligent listener starting with an AI mathematician. AI that can do really good math.
Carina:It can prove terms. It can formulate new problems, and it can discover new mathematical objects like graphs and fancy fancy functions that show up in important partial differential equations, etcetera. Yeah. Great to be here.
Sam:Super exciting topic. We're gonna jump in just to cover the high level docket for today. We're gonna go into a cool tool that Jordan built to really put the current LLMs up against each other, for math proofs. We're also just gonna dive into Axiom Math and your mission and hear directly from the source itself, you, about what you guys do and and and how it's all coming together. And then lastly, we're gonna cover some of the the latest and greatest in AI news.
Sam:I think some some big fundraises as always, Bezos' new opportunity, and, yeah, just what's happening in the last couple days. So with that, Jordan, I'd love for you to walk us through what you built with Acxiom in mind.
Jordan:Yeah. So, hey, Karina. Welcome to the show. Hey, Sam. Another great episode.
Jordan:Another crazy week in AI, obviously, with the news of Gemini three dropping this week and a bunch of other things. So, you know, it's almost never a dull moment, in the AI world. You know, when we first spoke with Karina, we talked about some ideas of just building some kind of vibe coding applications that could be cool for for math and could be even cool for Acxiom in general. So I built an LLM Math roaster. Karina, I hope you like it, but I wanna show you how it works.
Jordan:And it's it's productionalized, so it works like pretty well. Okay. Cool. So I have it in light mode and dark mode. Pretty cool.
Jordan:Right? Alright. I have some style. Alright. So dark mode is better for math.
Jordan:Right? I feel like that's like a little on mass.
Carina:No. Well Light mode?
Jordan:Light mode? Okay. Light mode for math. Alright. So here's how it works.
Jordan:So I I collected a bunch of different problems of different levels of difficulty. And then I have different models. I have Gemini 2.5 pro, GPT five, Claude Sonnet 4.5, and Brock four Fast. Now I wanted to add Brock 4.1 and Gemini three, but these models just came out like yesterday, and I don't have API access yet. So and even 5.1 is out, but I just didn't get it to like put it in in time.
Jordan:But before I run it, me just show you a little bit about what the application does. So you put in the math problem, it'll score it, roast it, and we'll go through that in just a second. You can actually submit a custom problem. So if you have your own problem, you can simply do that. That's awesome.
Jordan:We have the history. So if all the maths problems that have run-in the past, we can see how well they performed. And then I thought that'd be fun for your team, but I did a little extra and I built a little API. What your team can do is generate their own API key and using my API link here, they can pass in a request of a question and then what they can do is have in the background it run, have all the models go do their own results and then they can compare them. Let's jump into it.
Jordan:So first is, I'm okay at math but I'm certainly not your level. So the first thing I wanted to do was just build something really easy to prove whether or not my model works. So I did something, let's just you know two plus two equals four and at this this type of math is you know, probably at my level and I think I can solve it. So what I have all these models doing is I have Gemini, you know, they're correcting, they're writing kind of a lean proof. You can here see the the response times from each of the models.
Carina:Wow.
Jordan:We can compare the results of each of the models And then, you know, obviously, you have this problem which no model knows which model is smartest. So I chose ChatGPT. So ChatGPT will then evaluate the responses from all of the models. Let's just get this finished with two plus two, and then we can go into a more complicated question and see how well it does.
Carina:Yeah. That's great. Oh, evaluating.
Jordan:It has to get processed each one of these against, like, the core source. So, you know, ChatGPT's feedback is that all of them are correct. Okay. Great job. So fantastic.
Jordan:Two plus two equals four. All models know that. I think we're on good shape. So so far, the basis of truth. Right, Karina?
Jordan:I'm not a perfect scientist, but you know, I know a little bit. I have a few different problems. Maybe you wanna choose which one? I have a
Carina:Let's do Fermat Little Theorem.
Jordan:Okay. Fantastic. So that's considered an intermediate problem here. And here's the problem. For a prime number, p in any integer, a non divisible by p, we have a to the p minus one equals one mod p
Carina:That's right.
Jordan:Provide a formal proof in Lean four syntax. So I just read that out loud just for the audience, you know. So first, let's roast the LMS. So now we'll have the LMS. This is gonna take a little bit longer, so we can maybe chat in the meantime.
Carina:It's fascinating. There are so many Fermat theorems. I think like when I selected it, I was actually having something else in mind. But yes, I think this one is the Fermat Lizzo theorem.
Jordan:But chat tbt told me that it is when I built the app. So as I vibe go to this thing, I'm pretty sure it is. But the good part is is that you can kind of submit, you know, whatever formula you want. What I tried to do is get a collection of about 10 problems of different difficulty so that it could test the application and see kind of For some how well it does. You know, the easiest one was Pythagorean theorem, then I I felt like, you know, that's even like too difficult just to check, you know, basis of truth.
Jordan:That's why I built the Okay. Mathematic one. Just to check, you know, if everything's working essentially.
Carina:Yeah. And that's for lean? So it's generating all
Jordan:Yeah. It's gonna generate lean response. Wow. Uh-huh. Yeah.
Carina:Yeah. Yeah. We're not seeing the lean.
Jordan:Okay. Cool. Alright. So here we go. So here's Gemini Here's the response.
Jordan:Here's the theorem. You know, here it just gives the oh, here we'll provide both versions of Lean. Okay? Here's GPT five.
Carina:No. That is Lean. I haven't seen Lean in the Gemini Pro one yet. If you scroll down, the first one.
Jordan:It says it's below. Let's keep going. Right here.
Carina:Yeah. Got
Jordan:it. Right here. Cool. Alright. So we got import math lab.
Jordan:Right? So we've got like these different ones. What we can do, first, we can compare the results. So we can just review them all side by side. And then we can judge the proofs.
Jordan:I think that's the way to do it. We have two ways to do it, actually. This should score it. Yeah. So then it'll judge all the proofs.
Jordan:Mhmm.
Carina:We could also, right, like take that lean proof and then put it in like lean land and like, you know, and see if it compiles, which should be hopefully be pain more painless than having LM judge. Yeah.
Jordan:Sure. So we can take like I don't know which answer we want as
Carina:Probably wouldn't want to take the lean you know, just start with import Matlab.
Jordan:Yeah. Yeah. I got it. I But I'm saying we could just pick which one of these answers we want and we can go into a lean editor.
Carina:Entire thing? Yeah.
Jordan:Yeah. Okay. Let's just let the judging finish and let's see how well that does. It's just a really cool way to compare any type of multi level LLM results against any dataset. Right?
Jordan:And like in this case, I chose lean and math. But, know, this could be a lawyer writing a contract and sending it out, you know, amongst all four and asking them all to bring back, you know, which contract is the best. So it could be applicable in many businesses, even ones that are outside the scope of mathematics. Fascinating. Cool.
Jordan:So I have a leaderboard. It gave Gemini the best score. ChatGPT is 70, and so on and so forth.
Carina:So That that's weird. Don't you think? For two, it says it proves the theorem for natural numbers instead of for integers. Oh, I guess oh, it just didn't go with the negative numbers. That's fascinating.
Carina:Just a close miss. Gemini is 98. Wow. Okay. It's great.
Jordan:So I think what would be really interesting, Karina, would be is in this type of problem set would be to get like thousands. Right? And just iterate this over and over again and find out where where these models truly fail. And if you do this at enough iterations, like, I don't I don't know what, like, the right end number would be. But my guess is for each, like, major type of problem set, you'd be able to find the model's weakness pretty quickly and its ability to resolve those types of problems.
Carina:And that's actually the first thing we did when the office is, you know, still camper chairs and plastic foldable table. It's what we did.
Jordan:That's awesome. Okay. So anyway, that's my math roaster. I hope you like it. It's great.
Jordan:And that's enough about me. Let's jump into Acxiom.
Carina:That was super cool.
Jordan:Okay, Karina. So obviously, super young entrepreneur, super excited about your new round of investment, but let's start from scratch. Tell us, like, how did this whole thing come about? What is Acxiom? And what are you guys doing every day?
Carina:Acxiom is, I think, the coolest AI for math company
Jordan:Awesome.
Carina:In my in my opinion. We are building an AI mathematician, a model that can not only do do really well on benchmark problems, those Olympiad math problems that we just saw, but also develops capabilities that are like a research mathematician, including the ability to conjecture, which is to propose new problems, unseen problems, hypothesis, and the same system can then try to prove and verify in addition to solving these problems. So a prover and a conjecture talking to each other, forming the self play loop like we talked about in our vision block. Additionally, this system can also auto formalize, which is to convert the massive amount of mass corpus data into lean, turning mass into programming language, which, you know, have a lot of great technical advantages because we have seen reinforcement learning, all verifiable domains like coding, having achieved incredible breakthroughs. So by combining AI, math, and programming language, all three pillars together, we believe to see the same sort of step function jump in math capabilities, especially real mathematics, serious mathematics beyond the benchmark that a lot of models have seen.
Jordan:So that's obviously incredibly interesting. Math, obviously, historically was run on CPUs. Right? So the movement of running math equations onto GPUs is really this big jump. Right?
Jordan:And the ability to essentially kind of find new opportunities within the mathematical field. Is that is that a good way to say it?
Carina:I think it's a it's it's it's close. Well, I think we we use, you know, a lot of both GPU and CPU in our engineering effort. There's, you know, obviously, the large language models part run on the GPU and the LIMPod, you know, currently compiles through CPU. And there's interesting engineering problem as to how to maximize utilization because there is some sort of dragging it down, in fact, because CPU generally, you know, could work a little bit challenging with with the GPU. But I I wouldn't say that the the the the compute part is kind of the the the bottleneck.
Carina:I think the the difficult part is the data scarcity, which is there isn't there that much lean. There just isn't that much lean. And there are more than 1,000,000,000,000 tokens of Python code, only about, like, 10,000,000 tokens of lean code. That's a 100,000 times data gap. So that's, I think, what is really, really hot.
Jordan:Yeah. So, you know, speaking of that, how how I guess, like, artificial data becomes much more important in generating artificial lean problems
Carina:100%.
Jordan:If if that becomes, you know, more critical in sourcing the data because you can't just go to a library and buy, you know, a nineteen twenties book on lean. Right?
Carina:Well, I guess a lot of the companies I know, especially some Chinese companies working on this, are having a massive number of, like, lean data labelers, lean programmers to to try to generate human data. Of course, that's very expensive. Doesn't scale as well. We have where both database own synthetic data generation, both by converting the informal data that already exist, scrape them, and, you know, really augment them in a way that makes sense to convert to lean, auto formalization. We also have other efforts.
Carina:So given formal data, formal math data already in lean and that's your seed data, how do you make that 101,000 times Williams larger?
Jordan:Yeah. Okay. So speaking of lean, now that we have everyone has AI, and obviously it's it's almost like for mathematics kind of the jump of like having a calculator versus not having a calculator. Is lean now being taught in these in these mathematic, you know, educational programs? And are students like writing their math, you know, theorems in lean now?
Jordan:Or is that is that still too early and we haven't gotten there yet?
Carina:That's a that's a great question. I think a lot of universities are teaching Lean. I know a couple of universities in The States, and I know university of Washington, professor Joe Alper in the math department is teaching students Lean. At CMU, there's a really great Lean community. Professor Jeremy Avigad, professor Sean Volick, and others are teaching students Lean.
Carina:Stanford has an automated reasoning lab by professor Clark Barrett and other folks, and they have TAs that are grad students teaching students Lean. There are many universities in Europe, in in The UK that are teaching students LEAN as well. In fact, LEAN started I mean, not LEAN, but the MATLAB, the Mass Library started as a teaching effort at Imperial College London by professor Kevin Buzzard. He wanted to teach his students who are enrolled in his introductory level, freshman level proofs class to write better proofs and to train them to think rigorously. So he gave these students who might, you know, be a little bit behind on proof writing the exercise of formalizing everything in lean.
Carina:What he found is actually it's not those students as his audience, but the students who already don't need the class that, you know, they are very much advanced grad classes takers. And these students, because they are bored and they have nothing better to do, would love to join the effort of formalizing the undergrad corpus of math in MathLab. And two or three of them are actually currently working with Acxiom.
Jordan:Some of
Carina:them are, like, my closest friends over, like, the last five years that I met, like, via, like, MIT Imperial College London exchange program. So really sort of great to have these lean native students, and and now, like, they're PhD students, some of them have graduated from PhD. But that was, like, 2019, very early on.
Jordan:Awesome. How these, like, LLMs compare to, like, figuring out math versus, like, what you guys are doing? Because I think that's, like you know, this is the whole clip of, like, what what's exciting. Right? Like, if anybody doesn't know about math and they have no idea what you're doing and they're just like a layman who's into tech and AI, and they say, okay, what is this what's the difference between this and ChatGPT?
Jordan:Like, what's the difference between like an LLM and what you guys are trying to do?
Carina:Large language models are doing math in the informal space. They produce proofs in natural language that is English, and they will give you a multistep reasoning traces if they're a deep sea, but they will only give you, say, like, numerical answer if you're, you know, change your beauty. And and the numerical answer is a verifiable output, which is something that you can, you know, grade. Like, 971 is 971. The problem is without the sort of very detailed step by step deduction, it's hard to verify all these intermediate reasoning traces.
Carina:So it's very difficult for an expert or a layman to know whether there might be a hidden bug among, like, a 3,000 line mass proof. On the other hand, what Acxiom, the informal formal hybrid prover system we are building is to make sure that the output we give are programmatically verified. Verified. There is a LIM program underneath it, and the LIM program is executable. So in a way, we do not need an LMS judge or a human judge.
Carina:We just run it like you run a Python computer program. I think that's the biggest difference between these formal models and the large language models. In the formal model states, we know that Google DeepMind Elf approved system, the system that got Silver Medal, actually, one score below the gold cutoff is a formal system. This year, they have the Gemini system, and also OpenAI has the the strawberry team's IMO, IMO team. These are informal system that got really good scores on IMO.
Carina:Difficulty is IMO and other Olympiad mass benchmarks sort of half a curriculum, a set sort of syllabus of things that they're testing all, which limits their search space quite a bit. When the proof gets very complex and multistep, requires really hard mass capability, and we believe that formal system gonna be extremely helpful in scaling, continuing to scale on those harder problems.
Jordan:That's that's so so cool. Obviously, math's the backbone of kind of all technology, so it seems like there's some, you know, mathematical breakthroughs on the on the edge here that, you know, once these things really get their power, they they they can really turn out some new opportunities. That's super cool. Alright, Karina. Well, thank you for telling us about Acxiom.
Jordan:Thanks for joining us. Let's talk a little bit about the news for this week as it's been another crazy week. Gemini three just dropped. I don't know if you've tried it yet. I downloaded Anti Gravity, the new Google code editor.
Jordan:Lots to tell you guys, but I'm being throttled everywhere. So cursor, I can't get Gemini three to work. I can't get Gemini three to work in anti gravity. I burned all my tokens. So, you know, I'm I'm munching tokens all night long pushing Gemini three, but I've been incredibly impressed with its performance for coding and design taste, also much nicer.
Carina:We like Gemini three. It's also, like, so much cheaper to run compared to, like, GPT. One thing that's very interesting is we you know, on a certain benchmark, I can tell, you know, which one, and our model actually beats Gemini three
Jordan:Oh, wow.
Carina:In math improves yesterday.
Sam:I know Gemini three has been outperforming almost every LLM on the market in terms of test scores. I tried to use it quickly this morning to to build a a personal meditation app, and I bumped into some some hiccups, but maybe maybe it was user error and not the model.
Carina:It got 100% on AIME 2025, which is the high school level. So it's three levels before the IMO. So if you win AIME, you will go to US AMO, and then you'll go to MOP, and then you will eventually go to IMO. But it got 100% on Yeah. Yeah.
Carina:That's fascinating. But 100% does code execution shows how important the The
Jordan:code is needed. Yeah.
Carina:Yeah. That's the tool is important.
Jordan:Well, It says without tools, it got 95%. So, you know, obviously, that's a significant improvement.
Carina:Odd because it has what? The it has 20 well, guess that's odd because it has, like the number of AIME problems is 25 problems?
Jordan:Well, if you compare it to Gemini, looks like to
Carina:two Fifteen. Point 15 problems. So 15 problems on the AIME exam, so 95% is, like you know, got one wrong.
Jordan:Yeah. So they got a 88 last time, so they probably got two wrong without the tool. You know? Yeah. Mhmm.
Jordan:Pretty cool. But maybe it's calculated differently because Sonic got a 80 If it got a 88.
Carina:I guess it's like, you know, past Yeah. Probably. Yeah. Past k.
Jordan:Okay. But anyway, yeah, Gemini. Super cool. So, you know, I just think this is interesting. The Gemini model for a lot of reasons.
Jordan:I mean, one, this is the first, you know, real big model coming out of, you know, one of the type three hyperscalers, know, the Amazon, Google, Microsoft. So far, all those models have been behind, you know, pretty much clawed and anthropic and OpenAI historically, and this is a big leap. You know, Google has a lot of money and a lot of servers and a lot of warehouses and a lot of products to inseminate, you know, Gemini three Pro into their products from, you know, Google Suite to search to YouTube to ads. I mean
Carina:Yeah.
Jordan:I don't know. Big opportunity.
Sam:Yeah. 650,000,000 monthly active users across search, the Gemini app. I mean, that's an insane amount.
Jordan:Yeah. Insane amount of users. Obviously, you know, we saw this week Warren Buffett made a massive investment in Google. And I think if you saw the videos of I don't know if you guys saw the videos from anti gravity, the launch of the of the code editor, Sergei was front and center, you know. So yeah.
Jordan:Good. I mean, you know, this is this is about CEOs getting back into the back into the driver's seat. So why don't we jump over to story number two? Because that's another CEO getting back into the driver's seat. So Jeff Bezos is starting a new AI startup called Project Prometheus.
Jordan:He's gonna be the co CEO. To me, honestly, this just seems like Amazon doesn't really have access to, you know, a foundational laboratory like Microsoft does with OpenAI, and they're trying to build their own. I mean, it seems like Google has DeepMind, and Microsoft has a relationship with OpenAI, and Amazon doesn't really have a a stake in the ground. I don't know. What do you guys think about this?
Sam:Yeah. I don't know. To me, it's exciting that Jeff Bezos is, you know, coming out of semi retirement and, you know, taking an operational role. I would argue that, you know, he was the best CEO of the last couple decades. So to get him back into a driver's seat, I think, is super
Carina:Yeah. I think this is doing, like, physical science space, so kind of, like, a lot of real world testing AI for science. I'm quite excited about that part. This is seem like, you know, like, a OpenAI style, like, foundation model lab, but more like applying the theoretical findings and try to figure out how to, you know, deal with some material or data.
Jordan:Exactly. Hard to say what they're gonna do. Obviously, they they brought over Bajaj from from Verily, obviously, who was in the biotech startup at Google or Alphabet per se. I think overall, the trend is that you're getting, you know, these billionaire CEOs who, you know, could be on a beach, you know, or executives are retired per se, could be on a beach or whatnot, but they're coming back to build because, you know, we're in a new build era, that's exciting, I Yeah. Okay.
Jordan:Last story of the day, real quick. Major, major funding news from Suno. I don't know if you've used Suno, Karina, but I love to use Suno. I find it to be so fun. I think it's a great party trick.
Jordan:But, yeah, Suno raised a $250,000,000 series c from Menlo at something like a $2,500,000,000 valuation. I know we've talked about them a few times on here, Sam, but, yeah, curious who's
Sam:thoughts. Suno allows you to quickly generate, like, highly produced music from a single prompt. And maybe to end out the episode, we'll we'll have a Acxiom theme song that we can, you know, layer on But it's a really fun tool. You know, we with a quick prompt about Karina and Acxiom, we could have a pop song, a reggae song. Oh, no.
Sam:Who's talking about that? An idiom song. You name it. We could have, Acxiom, you know, prove out its, math problems in song, theoretically. No.
Sam:I'm just joking here. But it's it's a it's a fun tool that we like talking about. And in fact, like, the song of Built This Week, our podcast we created with Suno.
Carina:Yeah. I remember playing with Suno. It's like, it's amazing that, like, you know, I don't know anything about composition, but maybe there's some melody I want to turn it into reality, which, you know, at one point, I feel like when the AI gets really, really good and, you know, people who are hidden Romanogens, who have mathematical creativity but having never sort of been trained to do formal proofs could probably, you know, use the Acxiom tool and get a, you know, actual theorem just like music and artists creators using Suno.
Jordan:Yeah. Totally. You know, we covered on a previous episode, there's a new programming language where you can program music. And obviously, music at the end of the day is is incredibly mathematical. So it'll be interesting how, you know, maybe with some new math inventions, we can invent some new styles of music.
Carina:Most of the double major as in or, like, the humanities social science classes, the math majors that MIT take are music, actually, It's music
Jordan:it's incredibly mathematical. And, you know, it doesn't really matter, like, what instrument you play. You know, the more you learn about the instrument, the more you get into the, you know, learning about about the triads and the triplicates and, you know, how how the, you know, the the notes and chords are created. Alright, Karina. Well, this was a great episode.
Jordan:Thank you so much for joining us. Sam, this was awesome. Love to learn about Axiom, talk about math, build some new tools this week. Obviously, Gemini three's kept me up all night. I don't know about you guys, but I'm gonna continue to work on it and play with some new stuff.
Jordan:But, yeah, this was really fun.
Sam:Karina, thank you. It was a pleasure. You know, this is a pretty big pivot from our normal routine episode, so it's been extremely informative and really appreciate you coming on.
Carina:Thank you so much. Yeah. Was a lot of fun.
Jordan:See you all next week. Bye.