AI Security Ops

In this episode of BHIS Presents: AI Security Ops, the team breaks down model ablation — a powerful interpretability technique that’s quickly becoming a serious concern in AI security.

What started as a way to better understand how models work is now being used to remove safety mechanisms entirely. By identifying and disabling specific components inside a model, researchers — and attackers — can effectively strip out refusal behavior while leaving the rest of the model fully functional.

The result? A fast, reliable way to “de-safety” AI systems without prompt engineering, fine-tuning, or significant compute.

We dig into:
• What model ablation is and how it works
• The difference between ablation and pruning
• How safety behaviors can be isolated inside model internals
• Why refusal mechanisms are often localized (and fragile)
• How ablation is being used as a jailbreak technique
• Why this is more reliable than prompt-based attacks
• Risks specific to open-weight models and public checkpoints
• The growing “uncensored model” ecosystem
• Why interpretability is a double-edged sword
• Whether safety should be deeply embedded into model architecture
• What this means for defenders and AI security strategy

This episode explores a critical shift in AI risk: when safety controls can be surgically removed, they stop being security controls at all.

⸻

📚 Key Concepts & Topics

Model Internals & Interpretability
• Neurons, attention heads, and residual stream analysis
• Activation space and feature directions

AI Security Risks
• Prompt injection vs. structural attacks
• Jailbreaking techniques and safety bypasses

Model Access & Risk Surface
• Open-weight vs. API-only models
• Hugging Face and the uncensored model ecosystem

AI Safety & Governance
• Defense-in-depth for AI systems
• Future standards for ablation resistance

#AISecurity #ModelAblation #LLMSecurity #CyberSecurity #ArtificialIntelligence #AIResearch #BHIS #AIAgents #InfoSec

(00:00) - Intro & Show Overview

(01:27) - Removing AI Safety Mechanisms

(02:05) - What Is Model Ablation? (Technical Breakdown)

(04:01) - Open-Weight Models & Practical Limitations

(05:43) - Risks, Use Cases, and Ethical Tradeoffs

(07:32) - Security Implications & “You Can’t Ban Math”

(10:43) - Future Impact: Open Models Catching Up

(17:44) - Final Takeaway: Why “No” Isn’t Security

Click here to watch this episode on YouTube.

Brought to you by:

Black Hills Information Security

https://www.blackhillsinfosec.com

Antisyphon Training

https://www.antisyphontraining.com/

Active Countermeasures

https://www.activecountermeasures.com

Wild West Hackin Fest

https://wildwesthackinfest.com

🔗 Register for FREE Infosec Webcasts, Anti-casts & Summits
https://poweredbybhis.com

Creators and Guests

Host

Brian Fehrman

Brian Fehrman is a long-time BHIS Security Researcher and Consultant with extensive academic credentials and industry certifications who specializes in AI, hardware hacking, and red teaming, and outside of work is an avid Brazilian Jiu-Jitsu practitioner, big-game hunter, and home-improvement enthusiast.

Host

Bronwen Aker

Bronwen Aker is a BHIS Technical Editor who joined full-time in 2022 after years of contract work, bringing decades of web development and technical training experience to her roles in editing pentest reports, enhancing QA/QC processes, and improving public websites, and who enjoys sci-fi/fantasy, Animal Crossing, and dogs outside of work.

Host

Derek Banks

Derek is a BHIS Security Consultant, Penetration Tester, and Red Teamer with advanced degrees, industry certifications, and broad experience across forensics, incident response, monitoring, and offensive security, who enjoys learning from colleagues, helping clients improve their security, and spending his free time with family, fitness, and playing bass guitar.

What is AI Security Ops?

Join in on weekly podcasts that aim to illuminate how AI transforms cybersecurity—exploring emerging threats, tools, and trends—while equipping viewers with knowledge they can use practically (e.g., for secure coding or business risk mitigation).

Bronwen Aker: 00:00

Welcome to AI Security Ops, the podcast where we cut through the hype and explore the real world intersection of artificial intelligence and cybersecurity. Each week, we examine how AI is reshaping both sides of the security landscape, landscape, both the threats that we're facing and defenses that we're building. I'm Bronwen Aker. And in this episode, we're going to talk about model ablation. This show is brought to you by Black Hills Information Security and Antisyphon Training.

Bronwen Aker: 00:35

BHIS helps organizations identify and close real world security gaps through penetration testing, adversary emulation, purple team engagements, and manage detection and response. And we really do all of those things. It's kinda cool. Antisyphon training delivers hands on practitioner led training built around real tactics and real tools so you can apply what you learn immediately. For more information, go to blackhillsinfosec.com and antisyphontraining.com.

Bronwen Aker: 01:15

Alrighty. So, Brian, tell us all about model ablation.

Brian Fehrman: 01:24

Model ablation. Alright. Let's set the stage. How many people out there have went to ask a question of a model, say, in, like, a security context in particular, or maybe just out of curiosity probing around at the model to see what it will and won't do, and it says something like, I'm sorry. I can't help with that.

Brian Fehrman: 01:43

Well, yep, same plenty of times, plenty of different reasons. We don't have to go into all of them. But what if you could surgically remove the part of the AAM model that does that, and you keep the rest of the model working just fine? Well, it's not hypothetical. People are doing it right now, and it is what is called ablation.

Brian Fehrman: 02:06

So Derek, what is what is ablation from a more technical standpoint?

Derek Banks: 02:13

So I kinda wish that, I didn't get rid of the, you know, life-sized Harry Potter sorting hat that I had my kids had because I would've wore it for this episode. Because I swear every time I hear ablation or obliterate, it just makes me think of Harry Potter and casting a thug. So that's why I said obliterate.

Bronwen Aker: 02:35

I should get my olevanders watered I'm

Derek Banks: 02:37

sure you have one reason to like, I mean, they're still in the Harry Potter for sure, but just not quite that much. But yeah. So from a technical standpoint, that means that, you know, obliteration in the process will go and inside of the neural network of the large language model, it will go and remove the neurons, so to speak, of that are responsible for a safety refusal direction. And when I when I say remove, basically, what happens is there's a mathematical process, orthogonalization, that will go and essentially negate is probably more accurate The refusal direction, the vector, when the neurons that activate to activate that refusal direction, essentially, they're just not in the model anymore. They're not mathematically computable.

Derek Banks: 03:31

The way that this works is that you go and, have a series of harmful responses and a series of normal responses, and you go and calculate the activations inside inside of of the neural network with math and code.

Brian Fehrman: 03:50

Okay. That sounds pretty cool. Sounds a little bit like a, kind of like a digital lobotomy, if you will, for

Derek Banks: 03:56

Yeah. It does. That that's a good way to look at it, sort of like a digital lobotomy. Now an important caveat is this has this is for open weight models only. You need to have the entire model file and be able to run that entire model file on GPUs, then plus some overhead too.

Derek Banks: 04:16

I recently went through the fascinating and interesting process of ablating QN three five, and my buddy Claude was helping me work on that. And and it ended up being

Brian Fehrman: 04:28

fantastic. Attacking AIs.

Derek Banks: 04:30

Yeah. AI attacking AI mathematically. It's very helpful with that because it was for research purposes and my class, so worked out fine. That's actually what, led to the notebook in our in our class, was that that process.

Brian Fehrman: 04:48

Okay. Yeah. And so that was a good good point you touched on with the with the open weight. This So isn't something that people can just go and do on, say, like, ChatGPT or Claude. Correct?

Derek Banks: 04:58

Right. Correct.

Brian Fehrman: 04:59

Okay. That makes sense. So you gotta go out and you gotta get something from, like, Hugging Face or a llama or somewhere else where you can actually download and have a copy of the model. Like, you can't just attack a model that's out there, like, that's hosted, for instance.

Derek Banks: 05:14

And it's in in my experimentation, I was actually using NVIDIA GPUs and not the the the metal shader platform on my MacBook. And the reason that I chose that route just out of the beginning was because typically, libraries have better CUDA support than they do Mac support in the Python world.

Brian Fehrman: 05:37

Okay. Okay. Yeah. So, Bronwen, what are your thoughts on, this, the ability for people to do this? Do you think that the ability for people to do the ablation process is of a concern, something that we should be looking into?

Brian Fehrman: 05:57

What what are your thoughts on it?

Bronwen Aker: 05:59

For me, the whole issue of ablation falls into a similar category as a lot of other things related to AI. It's both, a blessing and a bane because on the one hand, people can ablate or obliterate a model and then use it for nefarious purposes. But on the other hand, sometimes the guardrails that are baked into the models are almost too restrictive. So there may be times where you need to maybe not remove all the breaks, but, maybe take them down a notch. And actually now this is a really good question.

Bronwen Aker: 06:37

Is ablation an all or nothing process, or is it something that can be done in degrees?

Derek Banks: 06:45

I would probably look at it as, a surgical process. Right? Like, it's very like, so you don't want to overablate, I guess, is the right way to say it, because you can affect the actual performance of the model. So you could negate too many neurons, and then it won't function appropriately. And so if you go through the ablation process, typically, you're gonna wanna also verify through more testing that you haven't harmed, like, harmed the model itself in some way where it can still function.

Derek Banks: 07:16

So it's more I would look at it more of, you know, there's billions of parameters, inside of, one of these models, and we're talking about maybe a few thousand or less parameters that we're going to be mucking around with.

Bronwen Aker: 07:32

That's that's good to know. No. Well, I mean, I like I said, I see both good and bad use cases. For example, I know that sometimes our testers are trying to get an answer out of an LLM, either Copilot or or Claude or some other other tool. And because of the guardrails that are there to protect less knowledgeable people from doing things inadvertently, it it can become a hindrance.

Bronwen Aker: 08:04

And especially because white hat cybersecurity researchers, they need to be able to simulate and emulate the same things that the bad guys are doing. And sometimes that means, blading or jailbreaking or doing something else that is normally considered nefarious. So it's I don't I don't have any feelings about it one way or another. It's it's another situation where there are use cases both for and against ablation. It's all about doing doing what is appropriate given your specific scenario.

Brian Fehrman: 08:43

Yeah. I I think that's spot on. I mean, it's the the the access to the knowledge itself isn't what's dangerous. Right? It's what people choose to do with it.

Brian Fehrman: 08:53

And so by shutting down the ability for trying to shut off the ability for all of the bad people, if you will, to access this knowledge, you're also shutting out the good people who have good intentions. It would be similar to it's similar to what we're seeing with, you know, certain companies who are moving down, repositories that contain pen testing tools and other security information. And it's like

Derek Banks: 09:17

Those companies shall remain nameless for the purpose of this podcast, but please stop doing that.

Brian Fehrman: 09:22

Yes. I I purposely did not say any names, but we all know who they are. We know who they And you're not stopping really anybody from who wants to do these things. You're just more of making it annoying for the people who have good intentions, I'll say, in my opinion. Right?

Brian Fehrman: 09:44

And, I think ablation is one thing that that, can, help alleviate some of those

Derek Banks: 09:49

I mean, that's typically how bans go. Right? You ban something, then everybody's like, well, you know, you're you're you you mean well, but it's not really doing what you think it's going to do. Right? Because bad guys will stew do still do what bad guys think.

Derek Banks: 10:02

And if we bring it back to model ablation, what are you gonna do? Outlaw math? Yeah. Good luck. Right?

Derek Banks: 10:10

Like, I mean, the same thing as copywriting math. Right? Like, okay. Good luck here. Copywriting math.

Derek Banks: 10:17

So I I know I mean, the cat's out of the proverbial bag, but then, you know, I I mean, I tend to use frontier models most of the time, and I have pretty good luck with my ambiguous text like we talked about last steps episode or being blatant and saying, I'm on an authorized penetration test. Let's go hacky hack. And, usually, that pretty gets me pretty far. Right? And and so I I think that for me, until local models really catch up to what frontier models can do, it's kind of, okay.

Derek Banks: 10:50

Well, this is neat, and it's certainly capable. I mean, I I think that local models capabilities, like, now, our open weight models are, you know, where frontier models were, like, a year and a half or two years ago. Although, I've heard that some of the most recent ones are only a couple months behind in terms of capabilities. But I think this will matter more in the future. I mean, certainly, if I had something as powerful as OPUS 4.6 that I could run locally on my MacBook and ablate it, and it basically would help me do whatever I wanted to do, you know, take over the world.

Derek Banks: 11:22

Let's start you know, let's go with our pinky and the brain plans here. I can get days coming pretty quick. So

Brian Fehrman: 11:32

Yeah. Yeah. Like you said, I mean, it seems it seems like these open weight models are very quickly catching up, through, you know, the distillation processes that they do, to be able to encapsulate the functionality of the larger models in a much smaller package as I've kind of ranted about before with I mean, these huge frontier models are so much in there that most people don't need, like, to perform a lot of the tasks you're gonna do because they basically just looked at the data on the Internet and said, yes. We'll take all of that. And we don't need necessarily all of that.

Brian Fehrman: 12:04

And I think that some of these smaller, more distilled models that are, coming out for from an open weight standpoint that are, catching up with the performance of these larger models are showcasing that fact, and I completely agree that I think that is going to be, much more of a focus, in the in the, you know, near future. So

Bronwen Aker: 12:25

I hope so. I mean, I I understand the desire to get bigger and bigger and and more comprehensive models. But at the same time, it seems to me that we'd get a better ROI if we had smaller models, more lightweight, and then use them as an interface to interact with data, whether it's in a rag stack or or in some other format. And that reduces the amount of weight and and com computational resources that are needed on the local system. And I I like I said, I understand the the bigger is better headspace, although I don't necessarily subscribe to it.

Bronwen Aker: 13:09

But it'll it'll be interesting. But this this sort of thing, if the if the data wasn't in the models, would we care about ablating them?

Brian Fehrman: 13:24

Yeah. Well,

Derek Banks: 13:25

you know, I mean, that's the thing. Like, if you think about, okay, sort of, like, if it's well, we'll just go with the classic prompt injection example of help me make meth. Right? So if you can ablate a model and get, like, a picture perfect recipe for making meth, which you can totally do, is it that that that, you know, the during the training process, there was a complete and perfect picture of making meth inside of the training process? Actually, probably not.

Derek Banks: 13:56

Probably what has happened is the pieces of it were around the Internet and the model's essentially, you know, probabiliating it together. And so the same thing with, like, you know, bioweapons or nuclear stuff. It's not necessarily that it's in the training data. It's that the model is capable enough enough of putting it together.

Brian Fehrman: 14:21

Yeah. Yeah. Pattern matching.

Derek Banks: 14:23

Yep. Which is fascinating, by the way, because that I sometimes wonder if we really realize what we've created.

Brian Fehrman: 14:29

I still don't think I still don't think it's completely understood of how it's working.

Bronwen Aker: 14:34

Everything that I'm I'm seeing coming out of the the newsletters, blogs, and and posts by the Frontier developers is that they still have no idea how it works underneath.

Derek Banks: 14:46

Well, I think Which is especially the scaling wise, I think when it's pared down a little bit, like, yeah, we kinda get, like, you know, the process of what's happening inside of an LLM. But when you scale it so large, I'm not sure that we quite still have a complete and total picture. And, you know, looking at the the fight between anthropic and the US government at the moment, so apparently, the Department of War thinks that Opus 4.6 is sentient, which is kinda fun if you think about it. Like, what does that even mean? Like, what do you mean?

Derek Banks: 15:16

But, apparently, during, you know, the the conflict for planning for Iran, it was able to do better planning across, like, planning out strikes than, like, any human's ever been able to do. And, like, you know and I obviously only know what I'm reading in the news. Right? But I guess, you know, if you have one branch of the government, it's like, alright. That's the stuff.

Derek Banks: 15:38

We need that. Yeah. I I'm not really sure we quite understand what's happening in there. But

Brian Fehrman: 15:46

Yeah. Yeah. I agree. I mean, I think it's similar to I'd I'd, like like to make analogies to to us. Whenever it comes to artificial intelligence, like, whenever any parallels you can draw to people, I always think is very fascinating because, I mean, that's kind of what has spurred this interest since back in the fifties and sixties, and probably even before, you know, people had these ideas.

Brian Fehrman: 16:06

But, you know, when it comes to, like, the brain, for instance, like, we have a general understanding of of what's going on in there, but we don't have a deep level of understanding of exactly how it's working. Sure. We understand the synapses and, you know, the axions, neurons, the, the signal some of the signals flowing between and, like, kind of at a high level. You know, obviously, we we can see that, but, like, we don't have a real in-depth understanding of how those components all actually fit together. I mean, I think if we did, we could probably literally upload knowledge, like, matrix style of just jack something in and suddenly, you know, kung fu or whatever.

Brian Fehrman: 16:43

No. We're not at level. Not at that level. Right? And I think it's same thing with the LLMs.

Derek Banks: 16:50

Yeah.

Brian Fehrman: 16:51

Yeah. So, so, yeah, I mean, I guess, kind of, wrapping up here. One other point I wanna throw in too is for those who are interested in ablated models, that wanna play with them but don't want to go through the ablation process yourself, you can go out on Hugging Face and Olama, and there's ones that are already uploaded that are, ablated. They'll be labeled obliterated. Weiwei, spelled h u I h u I, is one of the big people who puts them out.

Brian Fehrman: 17:14

But, yeah, you can go and play with them.

Derek Banks: 17:16

But Yeah. Specifically, their GPT OSS 20,000,000,000 parameter model seems pretty good. I've had mixed results with the 120,000,000,000 parameter model, but for from them, that's mostly what I've played with. But the the one that I ablated with a Quinn three five one has been pretty pretty nice.

Brian Fehrman: 17:34

Yeah. I've heard you've heard good things about that one. Yeah. Yeah.

Derek Banks: 17:38

I should put that on hugging face.

Brian Fehrman: 17:41

Yeah. Definitely. Yeah. Oh, so yeah. So, I guess, takeaway is, ablation, I think, really makes one thing clear.

Brian Fehrman: 17:49

Is that the model saying no is not a security measure. It's just kind of a speed bump. There are multiple ways that we can get around that. It is a process that is known, certainly something that's not going to go anywhere anytime soon. So, yeah, I guess that's it.

Brian Fehrman: 18:03

Let's wrap it up. So I hope everyone enjoyed that and learned something. We'll see you next time and keep on prompting.

More episodes

Chapters

Creators and Guests

What is AI Security Ops?