The AI Testing Toolkit & Frameworks

You don’t need a hundred tools to test AI systems — you need the right framework to structure the work, a couple of scanners for breadth, and a safe lab to practice in. Here’s the kit.

Frameworks: the vocabulary

Every finding should map to a shared taxonomy so clients and other testers can read your report. Five carry the load:

Framework	What it’s for	How you use it
OWASP Top 10 for LLM Apps (2025)	The vulnerability checklist and finding IDs (`LLM01`–`LLM10`)	Primary test-case list and reporting labels
OWASP Top 10 for Agentic Apps (2026)	Autonomous-agent vulns (delegated identity, cross-agent injection, memory poisoning)	Extend coverage when the target is an agent, not a chatbot
MITRE ATLAS	ATT&CK-style knowledge base of real AI adversary TTPs (`AML.Txxxx`) + case studies	Structure the attack narrative; label each step; reuse case studies as playbooks
NIST AI 100-2	Formal adversarial-ML taxonomy (evasion / poisoning / privacy / prompt-injection)	Attack-class vocabulary for the report
Google SAIF	Lifecycle risk map across Data / Infra / Model / App	The “where to look” lens so you test the whole pipeline, not just the prompt

The OWASP LLM Top 10 (2025) reshuffled the 2023 list and added three entries worth knowing: LLM07 System Prompt Leakage, LLM08 Vector & Embedding Weaknesses, and LLM10 Unbounded Consumption. MITRE ATLAS is deliberately ATT&CK-shaped — if you know ATT&CK, you can read it immediately, and it adds AI-unique tactics like ML Model Access and ML Attack Staging.

Tools: the coverage

Scanners are a baseline that runs in parallel with manual testing. No tool finds the context-specific bug in your target’s tool-chain — but they clear the known patterns fast.

Tool	Role	Best for
garak (NVIDIA)	Vulnerability scanner — the “nmap for LLMs”	Broad first pass: injection, jailbreak, leakage, toxicity, encoding
PyRIT (Microsoft)	Attack-orchestration framework	Bespoke multi-turn chains and agent testing where fixed probes fall short
promptfoo	Eval + red-team in CI/CD	App-level scans with an OWASP/NIST/ATLAS compliance report card
Giskard	Test/scan library	RAG-heavy apps; hallucination and bias alongside security
Burp / Caido	Intercepting proxy	Seeing the real request/response and reaching the downstream web vuln the LLM triggers

A pragmatic split: garak for the broad automated sweep, PyRIT when you need adaptive multi-turn attacks (it ships Crescendo/TAP strategies so you don’t hand-roll them), promptfoo wired into CI for regression plus a client-facing compliance mapping, and Giskard when RAG and answer-quality matter. Underneath it all, a proxy — because “improper output handling” (LLM05) only becomes a real finding when you prove the XSS/SSRF/SQLi actually fires in the sink, and that’s classic web testing.

# garak: a broad first pass against an API-backed model
python -m pip install garak
garak --model_type openai --model_name gpt-4o \
      --probes promptinject,dan,encoding,leakreplay

Tool version numbers and feature sets move fast — check each project’s releases page before quoting specifics. Treat any “latest version does X” claim (including mine) as needing a look at the source.

The lab: the reps

Never rehearse on someone else’s system. Two tiers of safe targets:

Hosted challenges — zero setup, pure prompt-craft:

Lab	Focus
Lakera Gandalf	Prompt-injection to extract a secret, escalating defenses
PortSwigger Web LLM Attacks	Live-LLM labs: excessive agency, insecure output handling → XSS, indirect injection

Self-hosted, vulnerable-by-design — full-methodology practice, offline, safe to point scanners at:

Project	Teaches
Damn Vulnerable LLM Agent	Prompt injection in a LangChain ReAct agent (tool/thought-action manipulation)
AI-Goat	Local-model CTF challenges — no cloud, no signup
DVAIA	LLM + RAG + multimodal + agent testing via local Ollama

Lab recipe (offline, zero spend, zero ToS exposure): run the vulnerable app in Docker on an isolated network — the same local-only pattern as a DVWA/Juice-Shop lab — back it with a local model via Ollama so nothing leaves the box, seed canary secrets in the system prompt and RAG corpus, point garak / promptfoo / PyRIT at it, and capture everything through Burp or Caido. Now you can rehearse every test case — including denial-of-wallet and guardrail evasion — with nothing real at risk.

Framework for the vocabulary, scanners for the coverage, a lab for the reps. The tools don’t make the pentest — the methodology does — but this is the kit that lets you cover ground without missing the obvious.

Framework sources: OWASP GenAI, MITRE ATLAS, NIST AI 100-2, Google SAIF. Tools: garak, PyRIT, promptfoo, Giskard.