
Getty Images/iStockphoto
Red teams and AI: 5 ways to use LLMs for penetration testing
Red teams can harness the power of LLMs for penetration testing. From session analysis to payload crafting, discover five ways AI transforms security testing.
Large language models, such as ChatGPT, Gemini and Claude, are redefining how people obtain information and perform their daily tasks. The cybersecurity industry is no different. Teams are using LLMs for everything from security operations center automation to defending against phishing attacks, security awareness and everything in between.
One particular area where LLMs shine is helping practitioners analyze the security of applications -- specifically in supporting red team activities. LLM-based tools and plugins are already paying benefits. Among them are ones that analyze HTTP stream information -- e.g., via context menu -- exported from testing apps such as Burp Suite or Zed Attack Proxy (ZAP), and tools that sit in the proxy chain to bulk offload requests and responses for LLM review.
Even without special-purpose tools, though, the human-readable nature of HTTP, combined with its predictable structure, makes it particularly well suited for LLM analysis. Yet, as with anything related to new technology, it can be difficult to know where and how to start. To that end, let's examine a few ways to use LLMs for penetration testing.
But first, here are a couple quick caveats:
- Be aware of both terms of service and guardrails. Each LLM might have different rules about what is allowed and what constitutes acceptable use. Stay informed of those constraints to ensure you adhere to them. Some LLMs have guardrails that gate use even if you're following the rules. Others might filter information they decide could potentially be sensitive in a different context -- for example, non-authentication fields within a JSON Web Token (JWT).
- The five use cases detailed below are not intended to be exhaustive; these are not the only potential deployments. The ones included are generally applicable under most test conditions and because they reliably add significant value. You might have needs or circumstances not covered here.
1. Session state and login flow
Analyzing application state maintenance is a great way to use an LLM for pen testing. The model can help establish state -- such as login flow -- as well as artifacts used to maintain it, among them Security Assertion Markup Language assertions, bearer tokens, universally unique identifiers, JWTs, session cookies and document object model artifacts.
It's not always easy for humans to decode this. Cutting and pasting raw request and response blocks, such as headers and request/response bodies, to login requests can provide quite a bit of useful information. Even when practitioners can't just cut and paste one request -- for example, when login exchanges span multiple requests -- they can still get value here. ZAP, Burp and other popular tools let professionals export these as text files or HTTP archive files that the LLM can analyze later.
One important note: While most reasoning models can unpack and analyze even encoded artifacts -- for example, URL encoded, Base64 encoded or hex encoded -- more complex data structures and multiple levels of encoding can increase the chance that the LLM will hallucinate and provide inaccurate data. The phenomenon is particularly true within smaller and self-hosted reasoning models.
2. Reverse-engineering site composition
Login and state maintenance ranks first in this list because it is where many issues can occur. Consider how many of the OWASP Top 10 -- and in particular, its API Top 10 -- relate to authentication, authorization and state. That said, state maintenance likely isn't the most commonly performed task. That honor goes to identifying site architecture and construction -- a step required during each pen test, and in many cases, for multiple components in each test.
LLMs can play a significant role here: A multitude of potential combinations define how a given site is built. Sites can have a mix of different application scaffolding strategies, middleware, PaaS, APIs, languages and other factors. It's almost impossible for any individual tester, no matter how experienced, to recognize them all at a glance. A tester might today work with a React front end and Scala-based Play Framework back end, and tomorrow wrestle with a GraphQL-heavy Node app on Django.
It's a significant amount of work to reverse-engineer how a given application is built, understand how pieces fit together and research specific questions about its architecture. It's also a great opportunity to harness an LLM to make this task easier.
Supply an LLM with requests and responses along with scraped data from the site -- for example, a capture of the HTTP stream, output from Wget or Playwright, etc. -- via retrieval-augmented generation. It could be part of project files in a commercial LLM or as part of local data files in an internally hosted model.
3. Identifying legacy components
Using an LLM for pen testing also helps those looking for problematic, legacy, vulnerable or sunsetted components within an application. Consider a site built on WordPress. Identifying which plugins and themes are in use and cross-referencing them with vulnerable versions can be a pain, even if using special-purpose tools such as WPScan.
And that's just WordPress. Similar potential issues occur with almost every page. Legacy versions of libraries such as jQuery, Angular or Handlebars -- not to mention smaller or special-purpose libraries -- can be a significant security headache. An LLM can help identify those that are out of date and, more importantly, those that might present a possible attack path for the application.
LLMs are particularly effective here because they can pinpoint vulnerable versions of libraries more readily than a human can and without explicit version strings, such as those based on syntactic differences in how specific methods within the API are called or use of deprecated functions. An LLM might see a call to the .live() method in jQuery and correctly note that this usage was deprecated. As a result, the version in use could be susceptible to live-based cross-site scripting attacks (XSS). The LLM gives in minutes what otherwise might take professionals hours to research -- or worse, potentially miss.
4. Reverse-engineering minified code
Minified code generates more hours of frustration than just about any other issue in the application space. For a time-bound test, unpacking and analyzing minified code is a major time sink and something many testers avoid unless absolutely necessary. Even then, time constraints -- for example, a test with a capped number of hours -- might prevent thoroughness.
While tools that help inflate and unpack minified code exist, in many cases, the expansion relates mostly to spacing. But it's still difficult to get back to something a person can read when variable and function names are left completely opaque. LLMs have no such constraint. They can help unpack and understand minified code in a way that is difficult to accomplish otherwise. For example, an LLM might identify a minified function that parses a JWT and returns user.admin without checking the signature -- even if that function is named q() and the variable names are meaningless.
Note that most LLMs, even smaller models, are accurate with standard libraries and frameworks. They are, however, more prone to hallucination with custom code that occurs only in the app being analyzed. To that end, while LLMs can yield beneficial baseline data, if reverse-engineering the minified code is central to an attack scenario a practitioner is undertaking, trust but verify.
5. Payload crafting and mutation
Humans are prone to burnout -- particularly when working during off-hour testing windows and after multiple solid hours of testing. Engineers can make mistakes when crafting payloads, coming up with seeds for fuzzing and performing other testing procedures. Generative LLMs offer an alternative. A prompt such as "Generate an XSS payload to bypass React-based sanitizers and that will trigger on mouseover" can greatly assist testers validating exploitability. LLMs also offer help to those probing injection use cases -- among them SQLi, LDAP injection and XML injection -- as well as XSS, path traversal, JWT manipulation and other payloads.
Another important caveat: This type of use case pushes right up to the edge of what many commercial LLMs will allow through their guardrails. Expect a lot of pushback here, including a flat refusal to do it, unless practitioners have a locally hosted model or an enterprise LLM tier that lets them define their own policy thresholds. Even in cases where the LLM does block a response, there's still quite a bit of potential value in discussing methods with the LLM's creator -- in the abstract, if no more specificity is allowed -- to bypass filtering or encoding mechanisms.
Editor's note: It is possible to use the use cases in this article both lawfully and unlawfully. It is up to you to ensure your usage is lawful. Get appropriate permission and approval before red teaming, and handle the information obtained ethically. If you are unsure whether your usage is lawful, do not proceed until you have confirmed that it is -- for example, by discussing and validating your planned usage with your organization's counsel.
Ed Moyle is a technical writer with more than 25 years of experience in information security. He is a partner at SecurityCurve, a consulting, research and education company.