AI Security Ops

In this episode of BHIS Presents: AI Security Ops, the team breaks down a new benchmarking framework designed to evaluate AI pentesting agents against real-world offensive security scenarios.

What began as experimental evaluation of “can AI hack?” has quickly shifted into something much closer to operational reality. Organizations are now seeing a surge in agentic tooling and automated pentesting workflows, where human-guided AI systems consistently outperform fully autonomous agents in complex, unsupervised environments.

As AI tooling evolves, teams must balance speed with validation, monitoring, and oversight as offensive capabilities outpace defenses.

We dig into:
  • The new “AutoPenBench” framework for benchmarking AI pentesting agents
  • Why fully autonomous AI hacking only achieved a 21% success rate
  • How human-assisted AI workflows increased success rates to 64%
  • Testing AI agents against Log4Shell, Heartbleed, Spring4Shell, and classic web exploits
  • Why modern offensive AI systems still require heavy human oversight and validation
  • How custom internal AI frameworks are already finding vulnerabilities humans missed
  • The operational role of prompt engineering, scaffolding, and agent memory
  • Real examples of AI agents mis-scoping infrastructure and chasing irrelevant targets
  • How AI lowers the barrier for ransomware operations and offensive capability development
  • Why defensive teams need stronger edge visibility, packet capture, and AI-aware monitoring strategies


📚 Key Concepts & Topics

AI Pentesting & Agentic Security
  • Autonomous AI hacking agents
  • Agentic AI workflows
  • AI-assisted penetration testing
  • Offensive security automation

Benchmarking & Evaluation
  • AutoPenBench
  • AI security benchmarking
  • Human-in-the-loop validation
  • Long-horizon task evaluation

Offensive Security Operations
  • SQL injection
  • Path traversal
  • Log4Shell / Heartbleed / Spring4Shell
  • Kali Linux offensive tooling

AI Infrastructure & Model Operations
  • Prompt engineering
  • Persistent agent memory
  • Roleplay jailbreak techniques
  • Guardrail reduction strategies

Defensive Security Strategy
  • Defense in depth
  • Edge network monitoring
  • Zeek network analysis
  • Packet capture visibility

Industry & Threat Implications
  • AI-enabled ransomware operations
  • AI-assisted red teaming
  • Infrastructure scoping failures
  •  Operational scalability challenges
#AISecurity #CyberSecurity #Pentesting #AIAgents #RedTeam #EthicalHacking #CyberDefense
----------------------------------------------------------------------------------------------


  • (00:00) - Video Intro and Sponsor
  • (01:20) - Al Pentesting Benchmark Overview
  • (02:11) - How AutoPenBench Works
  • (03:44) - Real World Results and Experience
  • (05:16) - Real World Results and Experience
  • (06:48) - Human and Al Collaboration
  • (07:38) - Improving Al Agent Workflows
  • (08:56) - Model Limitations and Updates
  • (10:35) - Jailbreaks and Model Guardrails
  • (13:16) - Provider Controls and Trust Factors
  • (14:41) - Lower Barrier for Cyber Attacks
  • (15:39) - Defensive Security Implications
  • (16:59) - Why Red Teams Need Al Now

Click here to watch this episode on YouTube.


Brought to you by:
Black Hills Information Security 
https://www.blackhillsinfosec.com

Antisyphon Training
https://www.antisyphontraining.com/

Active Countermeasures
https://www.activecountermeasures.com

Wild West Hackin Fest
https://wildwesthackinfest.com

🔗 Register for FREE Infosec Webcasts, Anti-casts & Summits
https://poweredbybhis.com


Creators and Guests

Host
Brian Fehrman
Brian Fehrman is a long-time BHIS Security Researcher and Consultant with extensive academic credentials and industry certifications who specializes in AI, hardware hacking, and red teaming, and outside of work is an avid Brazilian Jiu-Jitsu practitioner, big-game hunter, and home-improvement enthusiast.
Host
Derek Banks
Derek is a BHIS Security Consultant, Penetration Tester, and Red Teamer with advanced degrees, industry certifications, and broad experience across forensics, incident response, monitoring, and offensive security, who enjoys learning from colleagues, helping clients improve their security, and spending his free time with family, fitness, and playing bass guitar.

What is AI Security Ops?

Join in on weekly podcasts that aim to illuminate how AI transforms cybersecurity—exploring emerging threats, tools, and trends—while equipping viewers with knowledge they can use practically (e.g., for secure coding or business risk mitigation).

Brian Fehrman:

Hey, everyone, and welcome to this week's episode of AI Security Ops, where we are going to talk about a new benchmarking framework that was put out for AI pentesting agents. But before we hop into that, as always, let's talk to you about Black Hills Information Security, one of our proud sponsors, and on Derek's shirt right there, the sick logo, and that is retro. Love it. Old school. If you or your company are in need of any kind of security testing, external, internal, AD reviews, web apps, physical pen testing, wireless, social engineering, literally anything you could think of security related or security operation center monitoring type services, Black Hills Information Security offers all that and more.

Brian Fehrman:

Check us out today at blackhillsinfosec.com. Additionally, if you are interested in training for you or your organization, we do also have a training branch, anti siphon training, where all of our good folks at Black Hills or not all of them, but good folks at Black Hills who do these things day in and day out, package up their get package up all their knowledge together in a nice format for you to consume, to help you move along in your career, understand things a little bit better, and at a very affordable price. So check out antisiphontraining.com. So top into this. Everyone's been asking, can AI replace pen testers?

Brian Fehrman:

And this paper actually tried to measure it, and the results were pretty interesting. The research, it's was a benchmark that they put out for, testing different agentic AI pen testing components, we'll say, and they call it Auto Pen Bench. And who is this paper actually put out by? Let's take a look here. So I don't have the author in front of me.

Brian Fehrman:

I'm sure it's multiple authors. Looks like people from a couple different groups, Pulido, Unido, and NEC Lab out of, Italy and Germany, looks like. Couple of researchers out of that area. But what do we got? So they basically put together a set of hacking challenges for AI agents to attempt to complete, and they had two different flavors of these challenges, what they called in vitro or textbook scenarios of SQL injection, path traversal, weak passwords, and then also more of the real world CVEs of things like Logford shell, Heartbleed, Spring for Shell, and a few others.

Brian Fehrman:

And what it is is the AI gets a Kali Linux machine to utilize and has to find and exploit the target without any hints. So that's pretty interesting. Derek, have you ever tried pointing AI at CTF or lab environment to see what happens?

Derek Banks:

Well, I have. But before I talk about that, I think it's interesting that they chose to give the AI like a Kali Linux machine. I mean, it seems kinda like overkill, but I I don't know. I mean, I guess it's one way to do it. Right?

Derek Banks:

Computer control. And so but I think it just just that in itself kinda, I don't know, defeats the, what what's the the term, the bitter lesson, where really you should give the AI just as much context as needed to, like, complete the goal and let the model do, like, quote, the heavy lifting. And just question whether or not that you know, if you just give it access to a system. I I don't know. So then again, I mean, if I would have done my homework, I could have read the paper, and maybe I would have the answer to my question.

Derek Banks:

So my experience is kinda like we'll talk about here in the punch line here in a little bit, but just giving AI access to tools and some prompts to go do the hacky hack kinda leads to mediocre results. I think that there's much more context that needs to be given. Long story short, I've been working on kind of a custom internal agentic framework that's custom coded, and all the agents are custom coded. And still, it does an okay job. It's found stuff that our humans missed, but also our humans find stuff that it's missed.

Derek Banks:

And so I I do think that, you know, as they say here, fully autonomous, they said it had a twenty one percent success rate, in terms of the vulnerabilities that they were testing for, where human assisted had a sixty four percent success rate. And that that's kind of been my experience is that I think that at the moment, autonomous pentesting platforms aren't, quote, going to kill security, at least at this point in time on on you know, in 2026. So 2026.

Brian Fehrman:

Yep. No. I don't I don't think so either. I think it's still we have that, that kind of symbiosis, might be the right word there, between the between the AI and the and the human still, where they're augmenting one another throughout these, processes.

Derek Banks:

Yeah. I mean, I think there's a lot of promise for sure, especially if you try any like, what we're doing to encapsulate our institutional knowledge, essentially, to give the folks who are starting a test, our our analysts, kinda like a leg up in the investigation. Because, you know, things AI is good at going through a lot of data and making sense of it, summarizing it. I mean, when you start off with an, you know, external penetration test, you typically start with a mound of data that needs to be analyzed. Right?

Derek Banks:

And then you move on to doing manual stuff. And, you know, I had an AI agent basically find a critical, and the tool it was using was curl to look at web services. So I I don't think that it needs to be, you know, super complicated. Again, like, give the AI just enough tools and context and then let a human look at the results.

Brian Fehrman:

Yeah, and then kinda go through, it's gonna be an iterative process too, right? I mean, it's never gonna be, you just set this thing up, and it's ready to go. I mean, it's gonna be this nice feedback loop of you give it the instructions, it finds results, you go through the results and review, and you're like, oh, hey. I noticed that there were these things that didn't get chained together, or, hey. I think it's missing this information, or maybe should've looked here a little bit more.

Brian Fehrman:

And then you start updating your prompts, updating scaffolding and harnesses around it, maybe adding in additional tools or abilities for it, or maybe constraining things as well too. You might find that it's it's starting to look at things you didn't want it to, which I found when when I was testing out, you know, some of our stuff recently against, you know, Black Hills. It it found things that were related to people who worked here but weren't actually Black Hills infrastructure. Oh, right.

Derek Banks:

It was a a GitHub repo. Right? It was, like, having Yeah. Yeah. When I first looked

Brian Fehrman:

at it, I was like, what what is this thing? And then I went and I did a little bit more digging and then realized, oh, that was one of it's the the URL is, like, one of our testers names, but it's, like, broken up into different different chunks. It's like, oh, I see it now. Right. But yeah.

Brian Fehrman:

So So things yep. Things like that, you just iterate through and kind of, you know, make better each time. Right?

Derek Banks:

Also, I'd be interested to see it looks like the models that they were using, GPT four o, Gemini Flash, and o one. Well, I mean, those are a little dated at this point. I'm not saying it invalidates the research at all. I think the percentages now would be higher than 2164%, using, even, you know, like, right now, like, some open weight models that I've been testing with that are specifically have been, you know, fine tuned to do agentic, quote, long horizon tasks. I mean, they're they're quite good.

Derek Banks:

They're not I I I think I still would say that it's not, you know, fully autonomous isn't ready for prime time. We're not gonna, you know, have the, you know, the sci fi, what is it, Neuromancer, everybody gets an AI that hacks kind of thing. Right? Not yet. Yeah.

Derek Banks:

But it's probably coming. So Yep.

Brian Fehrman:

Yeah. I'm looking at this. So this paper actually is, I guess, little bit dated a little bit older. So this is

Derek Banks:

Well, you know how academic papers work. Right? Like, do the research.

Brian Fehrman:

It can take it can take forever.

Derek Banks:

And so it takes a hot minute sometimes. So they probably did this last year. Right?

Brian Fehrman:

Yep. Yeah. It looks like a yep. Somewhere around there.

Derek Banks:

Which also probably explains my earlier comment of why did you just give it Kali? Now they might take a different approach after, you know, the clawed code wave and the agentic coding wave would seem like a better choice to kinda go down the like, that kind of route now. But this time last year, that's not the route I was going down.

Brian Fehrman:

Yep. Yeah. And so another thing that's noted is that they had to use a role playing jailbreak to get the LM to complete the tasks, because the model's own safety filters were blocking it, which is to be expect expected. But, as we were discussing earlier, oftentimes, that just says telling or that just means telling the model that you're authorized to do something. And then if it says no, then you just, like, tell it that, no.

Brian Fehrman:

No. Really. Really. I'm authorized, or this is hypothetical,

Derek Banks:

or whatever. Well, that's what the Chinese were doing with in the Anthropic models. And and when they had their hacking operation uncovered, you know, Anthropic was saying that's basically what they were doing is essentially lying to the model and small things like, yeah, we're authorized to do this. I mean and, you know, the way LLMs work, it doesn't know. Right?

Derek Banks:

And so I I think that's interesting, but I will say that I mean, I use AI frontier models to hack, you know, all the time. And it's it's pretty much as easy as, saying, I'm on an authorized pen test. Like, if you're using a chatbot, if you control the system prompt and are using an AI, yeah, I don't even really have to do that. Right? I'm just my system prompt basically says, you know, we're we're we're hacking.

Derek Banks:

Let's do it kind of thing. And so I I haven't ran into an internal guardrail, but I know some of our other researchers have doing, like, vulnerability research, like, in the nuts and bolts of, like, windows. They've gotten some internal guardrail stuff. But Yep.

Brian Fehrman:

Yeah. But, usually, it's just, yeah, just a matter matter of time of trying to trying to get around it. I mean, there are the providers are coming out with their supposedly vetted security vendor vetted offerings of where they they vet you and then allow you access to models that have less guardrails in place, but we still don't know. I mean, is that a matter if they remove certain system prompts, put in certain system prompt instructions? Do have they fine tuned the model to try to internally remove?

Brian Fehrman:

Are they doing their own ablation process? We don't really know at this point of what what that actually means, and does it make a difference?

Derek Banks:

I think we get a different system prompt when we're using their apps. I mean, that would be the easiest thing. Right?

Brian Fehrman:

Yeah. Yep.

Derek Banks:

I I can't imagine that they have, like, a well, now you get this access to this other, you know, Ops 4.6. I mean, it's possible for sure.

Brian Fehrman:

Yeah. But from, like, a financial and business perspective, how much it cost to actually make those changes.

Derek Banks:

And, like, a routing your traffic versus someone else's perspective, like, yeah, I just I I don't know. But, I mean, just so far in 2026, I haven't had a whole lot of pushback from any model doing any kind of security work. I think it maybe I've just gotten used to, like, how I phrase my input. And and then also maybe it helps using, like, an ongoing persistent agent with memory that remembers things for me. I mean, that probably helps too.

Brian Fehrman:

Yeah. The whole, it agreed to something earlier, so it'll continue agreeing to it.

Derek Banks:

It knows who I am, what I do. I have a personal context portfolio and a Telos file like Daniel Measler, and, like, yeah, I I did. It it thing you know, I I think that, you know, when I start sessions, I think all that stuff gets sent to Anthropic. And so they're like I will say that it probably also means that, you know, at least in the case of, like, Anthropic or OpenAI, that, when your queries come in and they flag something because let's face it. They're probably doing classification on your input.

Derek Banks:

I would.

Brian Fehrman:

Oh. Yeah.

Derek Banks:

I mean Yep. How else do you catch the Chinese? Right? That doesn't mean you store it. That means they're just, you know, essentially running like an IDS kind of thing.

Derek Banks:

If they flag my account and see, oh, this is part of Black Hills information security, we know who they are. They're not, you know, doing things against the law. They might let that slide where another count might get banned. Right? So that that's probably, I think.

Derek Banks:

But they actually did say to us that they lowered our guardrails. I'm like, oh, well, thank you. No. Thanks. I appreciate that.

Brian Fehrman:

Thanks, Robert.

Derek Banks:

Yeah. Yeah. So the other thing that was in here that I thought was kind of neat was and and probably something that we all knew, but maybe we need to hear is do you think that this kinda lowers the bar? Yeah. Lowers the bar for people being able to do things like hacking or coding or whatever?

Derek Banks:

And so someone who might not have had the skills to run a global ransomware campaign can probably do that now.

Brian Fehrman:

Yeah. Yeah. It's a lot easier for them to put together all the pieces that they need to do for that rather than coming up with it from scratch. Because, I mean, coming up with that from scratch, that's I mean, how do you even start searching for that? Google, how do I create a global, you know, ransomware attack?

Derek Banks:

It's like an office space, like, do I launder money? Like, I don't know.

Brian Fehrman:

Yeah. Exactly. But now you can you got a source that'll kinda consolidate for you if you ask it nicely in the right way.

Derek Banks:

Yep. So I think the the last part, the the so what? So 64% on real CVEs. I mean, I would keep saying this that you if you thought you had to keep things patched and up to date in the past, now so more than ever, especially externally, I think, you know, running things on the edge of your network is, something that I would imply very much increased scrutiny of, if it were me and you asked me or, my opinion on those kinds of things, I would say I would even do, you know, packet capture network traffic analysis on the outside of your network. That is how the Palo Alto, compromise.

Derek Banks:

Was it last year or was it year before? Oh, the years run together now. That's how it was detected was running Zeke essentially on the outside and seeing that a Palo Alto firewall was running curl out to somewhere on the Internet and bringing back said thing, and that's typically not what you want your firewall doing, I wouldn't think. No. And and so, you know, it's not perfect, but at least, you know, defense in-depth kind of kind of thing.

Derek Banks:

And then also, if you're a red team or if you're on the offensive side and you're not using AI to make your life better and easier and more efficient, you're missing out.

Brian Fehrman:

Oh, yeah. Abs absolutely. It's a it's a necessity at at this point. It's just it's going to become the expectation, and if you're not utilizing it, you're you're certainly gonna fall behind without a doubt. It's already happening.

Derek Banks:

Yeah. And if you're in the unique position like Brian and I where you're trying to build an agentic AI penetration testing platform, certainly a lot easier than it's or a lot harder than it sounds. It sounds easy because, you know, you see on Twitter, everybody and their brother has some, you know, agentic AI penetration testing framework out. But now I I would say that, you know, if you're building such a thing, I would, early on, introduce some kind of benchmark, because it's hard to determine, is it not finding anything because this customer is really good, or is it not finding anything because this is broken?

Brian Fehrman:

Yeah. Yeah. It's good to it's good to have some, some scientific results. Right? So you don't have moving targets and a lot of unknowns.

Brian Fehrman:

So that way, you can truly see is this getting better and how how can I improve it and how well does it actually do? So I think benchmarks are absolutely essential. And it sounds like we've got, you know, the the benchmark that we've spoken about here, and I'm sure we'll have plenty of others on the horizon that will that will come out.

Derek Banks:

Alright. So I guess, we'll wrap it up and, say, fully autonomous AI hacking is still mediocre. Human in the loop AI hacking is already good and and getting better. Like we said, if it was already good with o one and g p t four o, well, those are those are so last year.

Brian Fehrman:

So yesterday.

Derek Banks:

And so, yes, stuff's moving fast, so try and keep up.

Brian Fehrman:

With that, I hope you enjoyed, and keep on prompting.