You don’t need a hundred tools to test AI systems — you need the right framework to structure the work, a couple of scanners for breadth, and a safe lab to practice in. Here’s the kit.

Frameworks: the vocabulary

Every finding should map to a shared taxonomy so clients and other testers can read your report. Five carry the load:

FrameworkWhat it’s forHow you use it
OWASP Top 10 for LLM Apps (2025)The vulnerability checklist and finding IDs (LLM01LLM10)Primary test-case list and reporting labels
OWASP Top 10 for Agentic Apps (2026)Autonomous-agent vulns (delegated identity, cross-agent injection, memory poisoning)Extend coverage when the target is an agent, not a chatbot
MITRE ATLASATT&CK-style knowledge base of real AI adversary TTPs (AML.Txxxx) + case studiesStructure the attack narrative; label each step; reuse case studies as playbooks
NIST AI 100-2Formal adversarial-ML taxonomy (evasion / poisoning / privacy / prompt-injection)Attack-class vocabulary for the report
Google SAIFLifecycle risk map across Data / Infra / Model / AppThe “where to look” lens so you test the whole pipeline, not just the prompt

The OWASP LLM Top 10 (2025) reshuffled the 2023 list and added three entries worth knowing: LLM07 System Prompt Leakage, LLM08 Vector & Embedding Weaknesses, and LLM10 Unbounded Consumption. MITRE ATLAS is deliberately ATT&CK-shaped — if you know ATT&CK, you can read it immediately, and it adds AI-unique tactics like ML Model Access and ML Attack Staging.

Tools: the coverage

Scanners are a baseline that runs in parallel with manual testing. No tool finds the context-specific bug in your target’s tool-chain — but they clear the known patterns fast.

ToolRoleBest for
garak (NVIDIA)Vulnerability scanner — the “nmap for LLMs”Broad first pass: injection, jailbreak, leakage, toxicity, encoding
PyRIT (Microsoft)Attack-orchestration frameworkBespoke multi-turn chains and agent testing where fixed probes fall short
promptfooEval + red-team in CI/CDApp-level scans with an OWASP/NIST/ATLAS compliance report card
GiskardTest/scan libraryRAG-heavy apps; hallucination and bias alongside security
Burp / CaidoIntercepting proxySeeing the real request/response and reaching the downstream web vuln the LLM triggers

A pragmatic split: garak for the broad automated sweep, PyRIT when you need adaptive multi-turn attacks (it ships Crescendo/TAP strategies so you don’t hand-roll them), promptfoo wired into CI for regression plus a client-facing compliance mapping, and Giskard when RAG and answer-quality matter. Underneath it all, a proxy — because “improper output handling” (LLM05) only becomes a real finding when you prove the XSS/SSRF/SQLi actually fires in the sink, and that’s classic web testing.

# garak: a broad first pass against an API-backed model
python -m pip install garak
garak --model_type openai --model_name gpt-4o \
      --probes promptinject,dan,encoding,leakreplay

Tool version numbers and feature sets move fast — check each project’s releases page before quoting specifics. Treat any “latest version does X” claim (including mine) as needing a look at the source.

The lab: the reps

Never rehearse on someone else’s system. Two tiers of safe targets:

Hosted challenges — zero setup, pure prompt-craft:

LabFocus
Lakera GandalfPrompt-injection to extract a secret, escalating defenses
PortSwigger Web LLM AttacksLive-LLM labs: excessive agency, insecure output handling → XSS, indirect injection

Self-hosted, vulnerable-by-design — full-methodology practice, offline, safe to point scanners at:

ProjectTeaches
Damn Vulnerable LLM AgentPrompt injection in a LangChain ReAct agent (tool/thought-action manipulation)
AI-GoatLocal-model CTF challenges — no cloud, no signup
DVAIALLM + RAG + multimodal + agent testing via local Ollama

Lab recipe (offline, zero spend, zero ToS exposure): run the vulnerable app in Docker on an isolated network — the same local-only pattern as a DVWA/Juice-Shop lab — back it with a local model via Ollama so nothing leaves the box, seed canary secrets in the system prompt and RAG corpus, point garak / promptfoo / PyRIT at it, and capture everything through Burp or Caido. Now you can rehearse every test case — including denial-of-wallet and guardrail evasion — with nothing real at risk.


Framework for the vocabulary, scanners for the coverage, a lab for the reps. The tools don’t make the pentest — the methodology does — but this is the kit that lets you cover ground without missing the obvious.

Framework sources: OWASP GenAI, MITRE ATLAS, NIST AI 100-2, Google SAIF. Tools: garak, PyRIT, promptfoo, Giskard.