Cyber Sentries: AI Insight to Cloud Security

Navigating the AI Security Landscape with Somesh Jha
In this Cyber Sentries episode, host John Richards interviews AI expert Somesh Jha on using AI for security. They discuss the promise and perils of AI in cybersecurity, best practices for implementation, challenges with fine-tuning models, and adopting a multi-agent approach.

Jha provides insights on the potential of AI to transform cloud security through automating tasks like intrusion detection. However, attackers could also weaponize AI for large-scale spear phishing. As the technology matures, it remains unclear exactly what will be possible. The episode covers common mistakes like applying AI too broadly, the need for careful benchmarking to avoid hallucinations, the large data requirements for fine-tuning models, and the benefits of a multi-agent framework.

Questions You May Have
  • How can AI be used for good and bad in cybersecurity?
  • What are some common mistakes when applying AI to security?
  • How can we evaluate if an AI model is working well for security?
Key Takeaways
  • AI can automate spear phishing at scale, but also help detect intrusions
  • Start with a narrow security problem before expanding AI to broader ones
  • Careful benchmarking is crucial to evaluate AI security tools
  • Beware of AI hallucinations - grounding techniques can help
  • Fine-tuning AI models requires large datasets to work well
  • Adopt a multi-agent approach when building AI applications
Jha advises starting with a focused security use case and doing careful benchmarking to demonstrate value before expanding AI more broadly. He notes the challenges of fine-tuning models with limited data. Jha explains how Langroid is designed around a multi-agent approach for maintainable and extensible AI code.

This episode provides insights for security teams on leveraging AI responsibly, with practical advice on implementation pitfalls. Jha offers perspectives on realizing the future potential of AI in cybersecurity. His expertise provides a useful guide for applying AI to security effectively.

Links & Notes
  • (00:00) - Welcome to Cyber Sentries
  • (32:45) - Wrap Up

Creators & Guests

Host
John Richards II
Head of Developer Relations @ Paladin Cloud The avatar of non sequiturs. Passions: WordPress 🧑‍💻, cats 🐈‍⬛, food 🍱, boardgames ♟, a Jewish rabbi ✝️.

What is Cyber Sentries: AI Insight to Cloud Security?

Dive deep into AI's accelerating role in securing cloud environments to protect applications and data. In each episode, we showcase its potential to transform our approach to security in the face of an increasingly complex threat landscape. Tune in as we illuminate the complexities at the intersection of AI and security, a space where innovation meets continuous vigilance.

John Richards:
Welcome to Cyber Sentries from Paladin Cloud on TruStory FM. I'm your host, John Richards. Here, we explore the transformative potential of AI for cloud security. On this episode, I'm joined by Somesh Jha, a professor at the University of Wisconsin and co-founder of Langroid, an open source AI startup. We speak about how to benchmark success with AI models and discuss if they'll ever achieve the security holy grail, that is identifying zero-day exploits before they can happen. Let's dive in.
All right, everyone. Thanks for joining us. Today, we are joining with Somesh Jha. Super excited to have him here as our guest. Somesh, thank you for joining us. I would love to hear a little bit about what you're doing and how you got there. I know you're doing some incredibly exciting work.

Somesh Jha:
A little bit background, just a short background and a little bit about the journey, I got my PhD from Carnegie Melon in computer science, and my area then was something different, was programming languages, formal methods, and gradually I got into security and through that I got into trustworthy machine learning, which is basically the way to think about trustworthy machine learning is what happens when there are bad guys present? So the idea is that if everybody's good in the world, we won't need security. And so the idea is what can go wrong in machine learning systems, either through privacy attacks and so on.
And that's what trustworthy machine learning deals with. In that journey, I more got into machine learning and LLMs and so on, and we have founded this company called Langroid, which is basically building applications on top of LLMs, but having safety and security as one of the sort of guiding principles behind that.

John Richards:
I mean, you've been doing this before the recent wave of hype that's come into artificial intelligence, machine learning. So you've seen this going for a while. What do you see with this recent changes? Where do you see the future of artificial intelligence going especially in the space of how that intersects with cybersecurity? What's your vision? What would you like to see coming in that space?

Somesh Jha:
Yeah. So one thing that is very important to know is that the whole thing about LLMs, large language models, it's still very new. So we don't know when it matures what shape it's going to take. So I'll give you an analogy. I'm old enough that when internet first arrived. It took a while. There's always with new paradigm shifts. There's a little bit shakeout, a little bit chaos before everything stabilize. So that's very important to know that we still don't know when this all stabilizes. What is all this going to look like?
Now, on the security side, there are two ways to think about it. One is how do you make applications and ecosystems that use LLM safe from attackers? And the second dual use is how can attackers use LLMs to carry out their attacks much more sophisticated and a faster velocity? So let me give you an example of both. So for example, the first one, I call them security off AI, and the second one is AI for security. Sort of like it goes both ways. So for example, the first one is what happens is if you think about a financial application which is powered by LLMs, now immediately there is privacy issues that come out.
Imagine a financial analyst has private data about its client, what happens with the prompts there? The second is there are all these prompt injection attacks where what happens is they essentially bypass the input filters in these models. So for example, if you tell GPT-4 saying, "Hey, build me a bomb", it's going to say, "No, I can't do that." Now, if you change the prompt a little bit, phrase it a little bit, add some tokens, it suddenly does it. Okay?
So I think that this landscape, still hasn't matured. What attacks are possible? How do you take an end to end view of a system that is powered by LLM? Now, let's look at the second part. The second part is what happens with attackers? Some of the attacks that we were slow will suddenly become very fast. Let me give you a very interesting example. So for example, suppose I know John Richards is going to be in Las Vegas for a conference. You must have got these spear phishing emails that are very personalized, right? "Hey, John. I'm your friend X. I'm stranded in Las Vegas. You are also in Vegas. Please send me money."
I think everybody has got that. So now think about this way with things like GPT-4 and there are people already talking about this, I can scrape all the information like say, "Oh, John, there is Black Hat conference coming up." I take all the information about Black Hat conferences, all the attendee list. Also, look at all the flights coming into Vegas and I can suddenly send these emails at scale.

John Richards:
Yeah, it's terrifying.

Somesh Jha:
It's terrifying, right? I mean, because the thing is what used to require some maybe 20 minutes or 30 minutes of time, now I can just automate it. So I think this would be... To me, that also is something we need to keep an eye out for is what are the attacks that will become very fast and sophisticated so that the current defenses are going to have to adapt or maybe we'll have to think about new defenses?

John Richards:
Now, do you think AI has a place in trying to counter that? I know I've got friends who are teachers and they're saying people are using AI to create these letters, and then they're like, "Well, how can I use AI to identify if AI created that?" So do you think there's a space here for the folks trying to counter this to use AI? Because you're going to be working at a scale now that may be completely different than you've had to deal with in the past.

Somesh Jha:
So what you're talking about is what I call the large language model or Turing test. Did this media come out of LLM or is it natural? Right? Now there are systems out there like GPTZero and so on so forth to get that. But the problem is that those things are very easily bypassable. So for example, imagine a student gets an essay written by GPT-4, but now what they can do is they just paraphrase a little bit, change it a little bit, paraphrase it, and there are automatic paraphrases out there. Okay? This doesn't need to be a manual step, and keeps paraphrasing till it's not detected.
So it's going to be a cat and mouse game. But one potential glimmer of hope is the executive order from Biden has talked about something called watermarking. Watermarking, what it does is it inserts some signal in your output that depends on a secret key that the provider might know, like OpenAI, and then you use that secret key to detect the signal.
So these are called watermarks, okay? And I think that to me is probably a good way to maybe test it out and see. But I think there also, there are some attacks. So just long story short, it's a good idea, but I think we don't know what is possible, how brittle they are and so on so forth. For example, in your student example, we don't want to falsely accuse a student, for example. So the false positives here have to be at a very low level. You don't want to tell a student that, "Oh, you wrote ChatGPT," and sort of like, "Hey, no, I didn't do that." So these things have to be tuned at a very low false positive rate, which is very challenging.

John Richards:
Now, you had talked about how AI can scrape a lot of data that can be used for a negative, but we're seeing a lot of organizations trying to figure out if they can use this on the positive side. We get a ton of data from different maybe security tools or just looking through logs and networking traffic. What do you see as the possibilities there for trying to use AI to enhance security, identify possible threats, things like that? Because it's so much work for humans to try and dig through this information that it's often just ignored.

Somesh Jha:
So I think that is a great idea. So one of my colleagues said something like imagine pen testing. When you do pen testing, eventually what happens? You have a bunch of vulnerabilities. Pen tester is writing some attacks against it eventually saying, "Okay, I'm done." They get tired and they say, "I'm done with my pen testing." So one thing very good about these tools is that in some sense, think about this is a medium level analyst that never gets tired. So I think one thing would be to automate some of these pen testing stuff and there some of the demonstrations from actually good pen testers can be ingested or fine-tuned into these models so that they have some of those examples to lead by.
So for example, I collect these scenarios from experts and either I can fine tune or keep them in my database so that it can use in K-shot prompting. So that's a great idea. But to me, what will be interesting, John, is that does it bring new capability or is it just making what we have faster? That is not clear to me. Is it somehow going to be able to detect some attacks or intrusions that we weren't able to do before? Or is it just doing what we did very efficiently? So that to me is the big open question here.

John Richards:
And that gets back to what you're saying where it's a little chaotic. It hasn't settled out yet.

Somesh Jha:
Exactly.

John Richards:
And we don't know where this will land if new findings can be found or if we're just kind of scaling up what we're already doing.

Somesh Jha:
And I think one thing, you probably have seen this right there, there's so much noise in this space. Everybody and their mother are saying that they're an LLM expert. At least I find that even from LinkedIn post, it's very hard to tell who really knows their stuff and who doesn't, right? So I think that's another big challenge here is that it's just one of those things where just because it's so new that there's also a lot of borderline con artists saying that they're experts when they're not.
This is something that there's so much noise out there, but I agree with you that in maybe a year or two or whenever, stabilizes, then real players will emerge.

John Richards:
Yeah. There's a space now where if you see a challenge, there's like, "Well, can we throw AI at it?"

Somesh Jha:
Exactly.

John Richards:
What are the challenges to get there. So I don't know if you could share what are the things when people say, "Hey, here's a problem. I'll throw AI at it." That maybe they're not thinking about. What makes it challenging to get a model to the spot where it starts to provide value, even if you think you've found a good use case for it?

Somesh Jha:
So let me give you an example. So in security, we have used pen testing, dynamic analysis, fuzzing, static analysis quite a bit, right? And people have, a lot of, big companies have developed very good tools for these things. I think the key will be, first of all, I think you hinted at earlier, can we take some of the data generated by these tools and do some instruction fine-tuning or something like that, right?
Now, the second part is this to me would require some very careful benchmarking is will they be able to detect attacks that are brand new that we were not able to do. Because that's the holy grail in security. The holy grail in security is can you detect exploits that are zero-day before they actually go live? Right? And I think that to me there, the question is still open. We don't know. I think it will require a lot of benchmarking, maybe building an end-to-end system and so on, which I think you guys are doing a very good job at Paladin.

John Richards:
Well, thank you. Yeah, I mean we're along with everyone else exploring this and trying to see, "Hey, where will this land? How can we use this be beneficial?" But at the same time, as you mentioned, being aware of the risks and this kind of crawl, walk, run approach of how do you dig into this and make the data useful? You don't want just new data points and just overwhelm folks and you're not actually providing value there.

Somesh Jha:
What you just said is absolutely the point. The point is that I already am seeing posts which are saying, "Hey, in my use case, I'm not seeing value in LLM. They didn't do as good as I thought." So I think what we need is we need very careful benchmarking for problems. Where is the capability at? Now, one good thing is these... Like GPT-4 turbo is already a little better than GPT. So I'm assuming that these things will get better and better, and better, but we need a way to really benchmark them for security attacks and data in a very unbiased way.
So almost like we need something like an AV test type org for LLM for security, because you don't want these companies to do it themselves. You want to say, "Okay. I hired this other company. I'm using LLMs to identify this attack." Am I really finding new attacks here? But again, I think I keep coming back that I don't think if we have settled yet. I mean, think about it, ChatGPT was released what, seven, eight months back? So we are still in a very, very... It's hard to believe, especially after the last 10 days of all the drama at OpenAI, right?

John Richards:
Yeah.

Somesh Jha:
So we are still very early in this game, to be honest with you. We don't know.

John Richards:
Now, what do you think will help organizations that are doing this, and you've got a lot of folks rushing in to do this. What do you think will be the thing that lets an organization stand out from maybe competitors? Is it I need to have the best model? Is it who has the better model or is it more execution or implementation? Where do you see the differentiators will be as this area matures?

Somesh Jha:
I see a lot of companies just saying, "Hey, we are going to throw LLM at everything." This kind of like a kitchen sink approach. To me, it's very important to identify first a problem, a specific problem that you apply LLM. For example, maybe detecting malicious JavaScript or something like that. Something that has a narrower scope. There, you can also do a lot of interesting benchmarking and also tune the system much better. So I think it's very important.
I think I see this thing where you apply LLM to such a broad problem that it's very hard to then say, "Okay. Am I getting any value out of it?" So I think with companies, I think, will be important is to get some very two or three specific use cases, apply LLM to them, benchmark them really well so that you can show value. Because I don't think this kitchen sink, "Oh, we'll do everything using LLMs." Maybe that's not true. Sometimes you might have to use LLMs with some other tools that you already have. So I think that's what I would say go narrow and then broaden rather than go broad first.

John Richards:
That's really good advice. I mean, you've mentioned benchmarking here a few times. Any advice for teams that are out there doing this, on what kind of benchmarks they should be looking at? It could be challenging to say, "Is this a real good accurate result or was this just the AI hallucination?" So how do you build in some benchmarks to be able to say, "Hey, we are bringing some really great value here."

Somesh Jha:
So if you look at benchmarks, there are some benchmarks out there. If you look in the web for rag agents and other LLMs and so on. But they're only in certain domains like document chat, code chat. Some of the scenarios that were very popular in the beginning, what you need to do is look at those benchmark efforts and create a benchmark for your problem. So a lot of these benchmarks are very carefully done so that they have enough diversity of examples. Then what you can also do is take some sort of a prompt and sort of change it a little bit and see what happens.
So I think my view is that you should look at the benchmarking efforts in these other domains, but then use that to inform and get the benchmark for your domain. Now, you talked about hallucinations. So hallucinations is a big problem here. For example, I think you don't want to say something is an intrusion or malware and later it turns out it's not, and then you've just spent a lot of time. So there is a whole field in artificial intelligence in LLM. It's called grounding, okay?
Grounding is a field a lot of people are working on. It's still, again, in its infancy I would say that they basically try to... There are two strands of work here. One is detecting hallucinations. So let me give you a very simple example. What you can do is you can have a prompt. It gives you some answer. Now, if I paraphrase it slightly, but not change the meaning, the output shouldn't be too different. So they do some de-stress testing.
The other is using fine-tuning to encourage grounded answers where there is some evidence in the document or the data. So there are people working on grounding, which is basically handling the problem of hallucinations, but it's unclear whether how mature it is for the industry. A lot of people are doing research on it and so on.

John Richards:
When it feels like it also gets back, I mean, you said being specific, the more you can narrow this down, the more you can tell and the broader you are, you can't create those benchmarks. Also, it's a lot harder to know what you're getting here if it's all fully accurate.

Somesh Jha:
Yeah, exactly. Let me give you my JavaScript example. Suppose you have a JavaScript file, you have an LLM based tool that says whether it's malicious or not. Now I make very small tweaks to JavaScript, that JavaScript file, which should not change whether it's malicious or not, and it suddenly says it's not. Well, there is some brittleness there. It's very important what I said before, start narrow, then go broad rather than go broad and then things don't work and then you're like, "Okay, what do I do?"

John Richards:
Yeah. It's a lot harder to clean up the mess than... Okay, so folks out there that are looking to get into this, maybe they're developers or they've got a new startup or something and they're like, "We've got a problem. We think AI would be helpful for this." Where do you recommend these folks start as they're trying to either choose which LLM to use and how to best integrate with that and train it? Any advice for folks looking to get into that space?

Somesh Jha:
So initially what I would do is... So for example, let me talk a little bit about Langroid, right? So Langroid is a multi-agent framework, which is supposed to build applications on top of LLMs. And there are a lot of frameworks out there. There's one which we use called LightLLM. They're doing great work where basically what they do is they provide open AI style API access to multiple LLMs. So if you've written your code for GPT-4 with some very little tweaks, it works for any LLM. Okay?
So for example, Langroid works for any LLM out there that LightLLM has. Now, obviously you have to get keys for these things, GPT-4, Lama and so on and so forth. So what I generally tell people is that first start with... So for example, even in Langroid, we have some agents like doc chat, document chat, code chat, SQL chat. These are very mature scenarios for LLMs because these were the first ones that people were looking at. I would start with that. And really understand that and try to pick a use case that is new, not too far away from those because we know that those work.
So for example, we are working on something with analyzing EHR data using Langroid, and we wrote a little thing where we could ingest EHR data into a SQL table. So now we can use our SQL chat agent. So something like that, and then you gradually expand. So my strategy would be to start out with some scenarios that are very common, and so Langroid has quite a few of them. And then once you understand those, expand to your use case because there is enough acrobatics that you have to do. "Oh, how much should be the chunk size? How should I retrieve this and that?" Once you understand this, then that will inform your own system. So that's how I would start.

John Richards:
I've heard a lot of folks are trying to create or modify their own LLMs. Do you think it's better? I mean, it sounds like a little bit of what you're saying here is you're better off using something that's already been sorted out, understanding the process of getting and pulling that data out and then expanding and maybe adjusting it for your own versus trying to just jump into the deep end and say, "We're going to start messing around with our own model right away."

Somesh Jha:
So there are two things. So what you're talking is what is called fine-tuning a model. So what you do is you... Let's say, Lama, right? What you would do is you take start with the weights of Lama and you will train it for your test case. So for example, I will give my JavaScript example again. You will give it label dataset... Oh, this JavaScript is benign. This is malicious, so on, and you will train the Lama on top of that and then suddenly it can now do predictions, right?
There are two problems here. There are pretty good fine-tuning strategies now the state of the art is something called low-rank update, which is quite fast. But I think the big question mark here is these models are huge, have a lot of weights. How much data do you need so that they really start moving the weights and give you enough classification accuracy? That we don't know yet. For some tasks, yes, you can get reasonable. Some tasks you might need millions of data items. So that is the key. That depends on the task that sometimes with fine-tuning, you just have to depend on the task that whether you're getting accuracy or not.
So the short answer, you have to just do it and see. Sometimes if your data is small or smallish even, tens of thousands or the fine-tuning just might not move the weights enough to give you a good classification accuracy. The analogy I do is think about Lama or Mistral, or these things as big ship, you need a lot of momentum to change the direction. So a little bit of data is not going to change the weight that much, and then the accuracy is not going to be too much.
So it all depends on the domain and how much data you have. I think that's to me the biggest thing. So you can try it, but I think in general, I think my strategy is if something is giving you very good accuracy without fine-tuning, like using RAG or K-shot prompting, why not go with that? Because they're much more flexible engines rather than fine-tuning something. Because the other thing, John, is a lot of people don't know all the incense outs of what should the learning rate be and so on. Now, there are services that do that, but still you can fall into a lot of pitfalls.

John Richards:
Is there any kind of resources you recommend for folks trying to tackle that or learn those pieces?

Somesh Jha:
Yeah. What I would do is if you look at... There are a lot of GitHub resources there for fine-tuning. I would just start with that. A lot of which I talked about, low rank updates has a GitHub repo. Just start with that, fine tune, see whether you can even replicate their thing. Sometimes John, you must have had that, you kind of download some GitHub and it just doesn't work, so you have to be careful. You have to be careful. But yeah, I would just start with one of these popular fine-tuning rapports like Langroid and fine tune and then maybe gradually expand to fine tune on your data. Because fine-tuning is one of those things where kind of like you have to have quite a bit of data and also depends on what task. Some tasks are easier, some are more difficult.

John Richards:
Tell me a little bit here about how Langroid is positioning itself here to help developers trying to figure this out. I know you've got an open source approach as well. So what's the core problem that developers are hitting here as they're trying to interact with these LLMs that Langroid is able to help solve?

Somesh Jha:
With Langroid, what we realized is that the best way to program on top of LLMs is what I call multi-agent programming. Think of this way, in an organization, what do you do? If I ask you a complex question, let's say you are a legal or even a security, say, okay, this is the pretty complex question, let me break it down. And then for each of them, I will ask the right person. Now, for the JavaScript part, I will ask the JavaScript expert in my company and for maybe intrusions, I will ask somebody else and so on.
So Langroid says that this is how you should build applications as well. So like these multiple agents conversing with each other. So this is our DNA. This is a design principle that goes through our entire system. So that's why it's very clean. I always give this people analogy that why is Linux so clean? Because Linus Torvalds has a very clear goal of how the whole thing should be structured. You'll see there's some people try to... Some give code to Linux and they'll say, "No, this doesn't belong in the kernel."
So with Langroid that was very important to us that this multi-agent framework is our design beacon that goes through our design. So our design is very clean. Now, there are a lot of other open source things for building LLM powered app. What happens there is that they initially just started out as just some prompt chaining type framework. Somebody said, "Oh, it would be nice to have this. Oh, okay, let me add that. Let me add this. Let me add this."
So we got away from that. We are saying that, "No, our design principle is every LLM powered programming is a multi-agent programming, and everything in our design is going to fix that." Okay. Now, I cannot name the company. One of the companies that used Langroid basically said that they were so happy that it's so clean. Because, John, you know that clean code also means more maintainable and understandable.

John Richards:
Yes. You don't want that tech dead.

Somesh Jha:
If you have really a spaghetti code go to statements, all the jumping around. Right? So I think that was our design principle. Now, we are expanding from that and saying, "Okay. Now, the basic framework is open source. Anybody can use it. We even take requests." For example, somebody requested have an integration with Pinecone. Pinecone is a vector database, and it takes us very little time to add, especially because the framework is so simple, so well-structured. What we are trying to do is think about what would an enterprise person need for these kind of LLM powered applications.
So one thing, for example, we're going to release very soon, we have not decided in what form is privacy prompts. So suppose you have a prompt which has private data in it, social security number, somebody's name, and so on so forth, you can just encrypt them, but then it completely messes with in-context learning of LLMs.
So in-context learnings is kind of like saying that, "So what we wanted to tackle there is that we need to change the prompt enough to hide the sensitive data, but not so much that it totally screws up the LLM response." Okay? So we have something like that, a prompt sanitizer, and then we have some very nice ways of handling token cost. Okay? Token cost is going to be a big problem, right? Think of a long-running product 24/7 using GPT-4. Remember, every input token output token has money. So maybe for simple things that you just do for fun, it doesn't matter that much, but in a long-running product, that becomes a huge effort.

John Richards:
Right. As you scale up, costs are going to go along with it.

Somesh Jha:
Exactly. Think of a big company that is running this 24/7 and getting hit. So how do you handle these token costs? And the second, third thing, which you already pointed out is grounding. How do you detect hallucinations and you fix them? So what our strategy in Langroid is, we have the open source framework guided by this design beacon of multi-agent programming, and then we will have components that are more suitable for maybe sort of an enterprise level customer.
So that's kind of what we are thinking. Yeah. We really would want people to really use it. We want the community to grow. So if somebody sees this podcast wants to use it, please go ahead. If you just Google Langroid LLM, you'll see the GitHub page. Please download. And as I said, we are really good at fixing the bugs and so on and so forth.

John Richards:
Yes. If you're out there looking to get into this space, looking to start your LLM journey, please check out Langroid. Somesh, thank you so much for being a guest on here. It was such a pleasure to get to talk to you. So informative. I look forward to what all you guys are going to continue to do over there at Langroid, and look forward to the next time we get to connect.

Somesh Jha:
Yeah. It was a pleasure talking to you, and you asked great questions, John. I was really impressed.

John Richards:
Thanks so much, Somesh. Have a good one.

Somesh Jha:
Okay. You too.

John Richards:
This podcast is made possible by Paladin Cloud, a prioritization engine for cloud security. Are you struggling under never-ending security alerts? You can reduce alert fatigue with Paladin Cloud. It correlates and risk scores findings across your existing tools, empowering teams to identify, prioritize, and remediate the most important security risks. If you'd like to know more, visit paladincloud.io.
Thank you for tuning in to Cyber Sentries. I'm your host, John Richards. This has been a production of TruStory FM. Audio Engineering by Andy Nelson, music by Amit Sagie. You can find all the links in the show notes. We appreciate you downloading and listening to this show. We're just starting out, so please leave a like and review. It helps us to get the word out. We'll be back January 10th, right here on Cyber Sentries.