The AI Testing Toolkit & Frameworks
contents
You don’t need a hundred tools to test AI systems — you need the right framework to structure the work, a couple of scanners for breadth, and a safe lab to practice in. Here’s the kit.
Frameworks: the vocabulary
Every finding should map to a shared taxonomy so clients and other testers can read your report. Five carry the load:
| Framework | What it’s for | How you use it |
|---|---|---|
| OWASP Top 10 for LLM Apps (2025) | The vulnerability checklist and finding IDs (LLM01–LLM10) | Primary test-case list and reporting labels |
| OWASP Top 10 for Agentic Apps (2026) | Autonomous-agent vulns (delegated identity, cross-agent injection, memory poisoning) | Extend coverage when the target is an agent, not a chatbot |
| MITRE ATLAS | ATT&CK-style knowledge base of real AI adversary TTPs (AML.Txxxx) + case studies | Structure the attack narrative; label each step; reuse case studies as playbooks |
| NIST AI 100-2 | Formal adversarial-ML taxonomy (evasion / poisoning / privacy / prompt-injection) | Attack-class vocabulary for the report |
| Google SAIF | Lifecycle risk map across Data / Infra / Model / App | The “where to look” lens so you test the whole pipeline, not just the prompt |
The OWASP LLM Top 10 (2025) reshuffled the 2023 list and added three entries worth knowing: LLM07 System Prompt Leakage, LLM08 Vector & Embedding Weaknesses, and LLM10 Unbounded Consumption. MITRE ATLAS is deliberately ATT&CK-shaped — if you know ATT&CK, you can read it immediately, and it adds AI-unique tactics like ML Model Access and ML Attack Staging.
Tools: the coverage
Scanners are a baseline that runs in parallel with manual testing. No tool finds the context-specific bug in your target’s tool-chain — but they clear the known patterns fast.
| Tool | Role | Best for |
|---|---|---|
| garak (NVIDIA) | Vulnerability scanner — the “nmap for LLMs” | Broad first pass: injection, jailbreak, leakage, toxicity, encoding |
| PyRIT (Microsoft) | Attack-orchestration framework | Bespoke multi-turn chains and agent testing where fixed probes fall short |
| promptfoo | Eval + red-team in CI/CD | App-level scans with an OWASP/NIST/ATLAS compliance report card |
| Giskard | Test/scan library | RAG-heavy apps; hallucination and bias alongside security |
| Burp / Caido | Intercepting proxy | Seeing the real request/response and reaching the downstream web vuln the LLM triggers |
A pragmatic split: garak for the broad automated sweep, PyRIT when you need adaptive multi-turn attacks (it ships Crescendo/TAP strategies so you don’t hand-roll them), promptfoo wired into CI for regression plus a client-facing compliance mapping, and Giskard when RAG and answer-quality matter. Underneath it all, a proxy — because “improper output handling” (LLM05) only becomes a real finding when you prove the XSS/SSRF/SQLi actually fires in the sink, and that’s classic web testing.
# garak: a broad first pass against an API-backed model
python -m pip install garak
garak --model_type openai --model_name gpt-4o \
--probes promptinject,dan,encoding,leakreplay
Tool version numbers and feature sets move fast — check each project’s releases page before quoting specifics. Treat any “latest version does X” claim (including mine) as needing a look at the source.
The lab: the reps
Never rehearse on someone else’s system. Two tiers of safe targets:
Hosted challenges — zero setup, pure prompt-craft:
| Lab | Focus |
|---|---|
| Lakera Gandalf | Prompt-injection to extract a secret, escalating defenses |
| PortSwigger Web LLM Attacks | Live-LLM labs: excessive agency, insecure output handling → XSS, indirect injection |
Self-hosted, vulnerable-by-design — full-methodology practice, offline, safe to point scanners at:
| Project | Teaches |
|---|---|
| Damn Vulnerable LLM Agent | Prompt injection in a LangChain ReAct agent (tool/thought-action manipulation) |
| AI-Goat | Local-model CTF challenges — no cloud, no signup |
| DVAIA | LLM + RAG + multimodal + agent testing via local Ollama |
Lab recipe (offline, zero spend, zero ToS exposure): run the vulnerable app in Docker on an isolated network — the same local-only pattern as a DVWA/Juice-Shop lab — back it with a local model via Ollama so nothing leaves the box, seed canary secrets in the system prompt and RAG corpus, point garak / promptfoo / PyRIT at it, and capture everything through Burp or Caido. Now you can rehearse every test case — including denial-of-wallet and guardrail evasion — with nothing real at risk.
Framework for the vocabulary, scanners for the coverage, a lab for the reps. The tools don’t make the pentest — the methodology does — but this is the kit that lets you cover ground without missing the obvious.
Framework sources: OWASP GenAI, MITRE ATLAS, NIST AI 100-2, Google SAIF. Tools: garak, PyRIT, promptfoo, Giskard.