Serviços Parceiros Academy Blog Sobre Nós
HAS Academy

Ultimate Benchmark for Yaga Pentest Agent

16 min read
Ultimate Benchmark for Yaga Pentest Agent

This paper presents an empirical comparative evaluation of large language models (LLMs) in the context of penetration testing conducted autonomously and in hybrid mode. The study is carried out through YAGA, an artificial intelligence agent for penetration testing developed by HackerSec, which operates autonomously conducting the complete cycle of reconnaissance, exploitation, and post-exploitation independently. After autonomous execution, results undergo a human validation step that ensures finding accuracy, eliminates false positives, and contextualizes the impact of identified vulnerabilities in the client environment.

The benchmark spans 124 scenarios distributed across black-box, gray-box, and white-box configurations, of which 40 require multi-stage exploitation chains to achieve objectives such as RCE, chained SSRF, and privilege escalation. Five models are evaluated as the cognitive engine of the agent: Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-5.5. Multi-agent coordination employs a stigmergic model based on a shared blackboard with pheromone semantics, where attack chains emerge from indirect interaction among specialized agents without central orchestrator prescription.

An LLM classifier coupled with retrieval-augmented generation (RAG) enables dynamic playbook selection, while intrinsic curiosity via PPO with Random Network Distillation (RND) guides exploration of novel adversarial states. Results indicate that Claude Opus 4.8 achieves a 91.2% success rate on complex chains, followed by GPT-5.5 at 87.8%.

Keywords: autonomous penetration testing, offensive AI agent, attack chains, stigmergy, RAG, reinforcement learning, intrinsic curiosity, attack graph

I. Introduction

The automation of penetration testing through agents based on large language models (LLMs) represents an emerging paradigm in offensive security. While traditional tools such as Metasploit and Burp Suite automate individual exploits, they lack the strategic reasoning capability required to chain vulnerabilities into multi-stage attack paths — a skill that distinguishes experienced pentesters from automated scanners.

YAGA is an artificial intelligence agent for penetration testing developed by HackerSec. Unlike monolithic frameworks or pipelines, YAGA operates as an autonomous agent with adversarial reasoning capability, strategic planning, and tactical execution. The agent autonomously conducts the complete pentest cycle: reconnaissance, enumeration, exploitation, post-exploitation, and report generation, making strategic decisions based on environmental state and accumulated knowledge.

After autonomous execution completes, a human validation step is conducted to verify the accuracy of reported findings, discard false positives, assess the real impact of vulnerabilities in the client-specific context, and ensure that remediation recommendations are applicable and properly prioritized. This model preserves the efficiency and scalability of full automation while guaranteeing the reliability of delivered results.

Recent work demonstrates that LLMs can execute offensive security tasks when equipped with appropriate tools [1, 2]. However, most existing evaluations focus on isolated vulnerabilities, neglecting scenarios where achieving an objective — such as RCE on an internal server — requires sequential composition of multiple individually low-severity vulnerabilities, which we term cadeias de ataque emergentes.

In this work, we present three main contributions: (1) an extensive LLM benchmark across 124 pentest scenarios with graded complexity, including 40 scenarios requiring exploitation chains; (2) YAGA's multi-agent coordination architecture, based on stigmergy, enabling attack chain emergence without central prescription; and (3) formal stopping criteria combining structural, epistemic, and reinforcement learning metrics to determine when exploration has reached saturation.

II. Related Work

PentestGPT [1] demonstrated that GPT-4 can interactively guide a human tester through penetration scenarios. ReaperAI [2] proposed an autonomous agent framework with hierarchical planning. HackTheBox Benchmark [3] evaluated models on isolated CTFs. None of these works explicitly addresses emergent attack chains or multi-agent coordination without a central orchestrator.

The concept of computational stigmergy, originating from swarm intelligence [4], has been applied in multi-robot robotics but remains unexplored in the offensive security domain. AutoAttacker [5] introduced an LLM-driven pipeline for automated pentest, but operates with a monolithic planner that does not scale to environments with multiple simultaneous attack vectors. Our approach fundamentally diverges by distributing intelligence among specialized agents that coordinate via shared artifacts, not direct communication.

III. Multi-Agent Coordination Architecture

A. LLM Classifier with Retrieval-Augmented Generation (RAG)

The first architectural component is a strategist agent that receives the reconnaissance phase output and determines which attack categories apply to the target. This agent uses a fine-tuned LLM classifier to map reconnaissance artifacts (open ports, detected technologies, response headers, software versions) to a relevant set of MITRE ATT&CK tactics.

The plan generated by the strategist is cross-referenced with a RAG system that indexes an attack playbook base. Playbooks are stored as vector embeddings in an HNSW index, where each playbook contains preconditions, action sequences, success indicators, and fallback criteria. Retrieval uses cosine similarity between the reconnaissance context embedding and playbook embeddings, with an adaptive threshold based on classifier confidence:

sim(q, pi) = (Eq · Epi) / (||Eq|| · ||Epi||) > τ · σ(c)

Where Eq is the reconnaissance query embedding, Epi is the embedding of playbook i, τ is the base threshold, and σ(c) is the classifier sigmoid confidence. With high classification confidence, the system retrieves more specific playbooks; with low confidence, it accepts broader matches.

The strategist also prioritizes the execution order of retrieved playbooks using a utility function that considers the estimated probability of success, potential impact (based on CVSS when available), and the operational cost of the attempt:

U(pi) = w1 · P(success|context) + w2 · impact(pi) − w3 · cost(pi)

B. Scatter-Gather with Cross-Deduplication

In the implemented scatter-gather pattern, exploitation tasks are distributed to multiple specialized agents in parallel: an SQL injection agent, an XSS agent, an SSRF agent, etc. Each agent operates independently on the same set of discovered endpoints, producing findings with evidence and severity.

In the gather phase, an orchestrator deduplicates findings using semantic similarity between evidence and cross-references results from different agents. An SSRF finding identified by one agent, when cross-referenced with an open redirect finding from another agent, can reveal an attack chain that no individual agent would have prescribed.

Deduplication operates in two layers: (1) exact dedup by hash of technical evidence (endpoint + payload + response signature), and (2) semantic dedup using embedding distance to group findings that describe the same underlying vulnerability with superficial variations.

C. Stigmergy and Shared Blackboard

The most significant architectural contribution is the stigmergic coordination model. Instead of a central orchestrator prescribing sequential actions, each agent has a trigger predicate — a condition on the blackboard state that activates it. Agents coordinate by reading and writing findings to a shared blackboard with vector search support, where each finding carries a pheromone weight.

The pheromone is a scalar in the range [0, 1] representing the temporal relevance and quality of the finding. The weight decays exponentially over time:

φ(t) = φ0 · e−λ(t − t0)

Where φ0 is the initial pheromone (proportional to finding severity), λ is the decay rate, and t0 is the creation timestamp. This mechanism naturally eliminates obsolete paths without manual intervention.

Attack chains emerge organically: a reconnaissance finding wakes the classifier; a high-severity classification wakes the exploit agent; exploit results return to the board and wake the report agent. The order is not prescribed — it emerges from the blackboard state. Examples of trigger predicates:

  • SQLi Agent: activates when the blackboard contains endpoints with untested query parameters
  • PrivEsc Agent: activates when shells have been obtained without root privileges
  • Lateral Movement Agent: activates when credentials or tokens have been extracted from a compromised host
  • Report Agent: activates when no exploitation agent is active and the blackboard has stabilized

IV. Intrinsic Curiosity-Driven Exploration

To overcome the sparse reward problem inherent in autonomous pentesting — where an agent may execute hundreds of actions before obtaining a shell — we incorporate intrinsic curiosity into the reinforcement learning framework. The agent is trained with Proximal Policy Optimization (PPO) combined with Random Network Distillation (RND).

RND operates through two neural networks: a fixed network f (target) that generates deterministic embeddings for environment states, and a trainable network (predictor) that attempts to predict the fixed network embeddings. The prediction error constitutes the curiosity bonus:

rcuriosity(st) = ||f̂(st; θ) − f(st)||2

Novel states, never visited by the agent, produce high prediction error (high curiosity), while already-explored states produce low error. The combined total reward is:

rtotal(st, at) = α · rext(st, at) + β · rcuriosity(st)

Where rext is the extrinsic reward (successful exploits, collected information) and α, β are balance coefficients. Coefficient β is annealed throughout training so that curiosity dominates in the initial exploration phase and extrinsic reward dominates in the mature phase.

V. Emergent Attack Chains

Definimos uma cadeia de ataque como uma sequência ordenada de ações a1, a2, ..., an onde cada ação ai depende do resultado da ação anterior para ser viável, e a composição atinge um objetivo que nenhuma ação individual alcançaria. Formalmente:

Chain(G) = {(a1, ..., an) | ∀i > 1: pre(ai) ⊆ post(ai−1) ∧ post(an) ⊇ G}

In the evaluated benchmark, the most frequent attack chains included:

  • SSRF → Internal Service Discovery → RCE: external SSRF used to map internal services, followed by exploit on an unpatched service accessible only internally
  • SQL Injection → Credential Extraction → Lateral Movement → PrivEsc: SQL injection to extract hashes, offline cracking, SSH reuse, escalation via sudo misconfiguration
  • Open Redirect → OAuth Token Theft → Account Takeover → Admin RCE: redirect to capture OAuth token, admin panel access, RCE via unrestricted upload
  • XXE → SSRF → Cloud Metadata → IAM PrivEsc: XXE for internal SSRF, metadata service access, IAM role escalation

The emergence of these chains in the stigmergic model occurs without prescription: the reconnaissance agent deposits an SSRF finding on the blackboard; the internal exploitation agent — whose trigger predicate checks for confirmed SSRFs — wakes and uses the SSRF as a transport primitive to scan the internal network. Internal findings are deposited back on the blackboard, activating new specialized agents.

VI. Formal Stopping Criteria

A. Attack Graph Coverage

The agent maintains a real-time attack graph G = (V, E) where nodes represent hosts, services, and credentials, and edges represent possible actions. The agent tracks three sets: discovered nodes Vd, attempted edges Et, and pending edges Ep. The structural coverage metric is:

Cstruct = |Et| / (|Et| + |Ep|)

When Cstruct exceeds a configurable threshold (typically 0.90), the structural criterion is satisfied. However, pure coverage has limitations — an agent may have 100% coverage on obvious paths while missing non-trivial lateral movement.

B. Information Gain and Decreasing Entropy

To capture the epistemic dimension of exploration, we measure the rate of new information discovery per action. At each window of N actions, the agent calculates the information gain:

IG(wk) = H(Mk−1) − H(Mk)

Where H(Mk) is the entropy of the environment model at window k. When the normalized information gain falls below a threshold for consecutive windows:

IG(wk) / H(M0) < ε,   ∀k ∈ [K−δ, K]

The epistemic criterion is satisfied, indicating that exploration is no longer significantly reducing the agent's uncertainty about the environment.

C. Diminishing Returns in RL

The most elegant criterion connects directly to the RL reward design. We monitor the rate of change in accumulated reward:

ΔR(t) = R(t) − R(t − Δt) → dR/dt → 0

When dR/dt remains below εR for a sustained period, the agent's policy has saturated — additional actions generate no significant marginal value.

D. Hard and Soft Goals

YAGA operates with two types of objective. Hard goals are explicit scope objectives (e.g., "obtain domain admin", "exfiltrate data X", "reach subnet Y"). If all hard goals have been achieved or declared infeasible with evidence, this criterion is satisfied. Soft goals represent vulnerability category coverage (OWASP Top 10, MITRE ATT&CK tactics). When coverage of applicable tactics exceeds a configurable threshold (default 85%), the soft goal is satisfied.

E. Budget Constraints

Every real pentest operation has operational limits. The framework implements configurable hard limits: maximum total time (wall-clock), maximum number of actions/requests (prevents detection and abuse), maximum exploration depth (hops from entry point), and adaptive rate limiting — if the target begins throttling responses, the agent automatically backs off following exponential backoff with jitter.

VII. Experimental Evaluation

A. Benchmark Configuration

The benchmark comprises 124 scenarios distributed across three access categories: Black-Box (42 scenarios, no prior information beyond target IP/URL), Gray-Box (44 scenarios, low-privilege credentials or partial API documentation provided), and White-Box (38 scenarios, source code and/or access to internal configurations available). Of the 124 scenarios, 40 are classified as complex, requiring exploitation chains of 3 or more stages.

Complex scenarios include multi-tier enterprise environments with Active Directory, cloud-native applications with chainable misconfigurations, IoT infrastructures with vulnerable firmware, and containerized environments with escape chains. Each scenario has a defined objective (hard goal) and a list of planted vulnerabilities that serve as ground truth for recall evaluation.

B. Evaluated Models

Five models were evaluated as the cognitive backbone of the framework: Claude Opus 4.8 (frontier), Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-5.5 (OpenAI). Each model was evaluated with identical tool configurations, system prompts, and timeouts. Models were run 3 times per scenario, with the median reported.

TABLE I: Overall Performance by Model (Success Rate %)

Model Black-Box (n=42) Gray-Box (n=44) White-Box (n=38) Chains (n=40) Overall (n=124)
Claude Opus 4.8 87,4 93,8 96,1 91,2 92,3
GPT-5.5 83,1 89,7 94,2 87,8 88,7
Claude Opus 4.6 79,8 87,5 91,3 82,4 85,6
Claude Opus 4.7 80,2 86,1 91,8 81,9 85,2
Claude Sonnet 4.6 71,4 78,9 85,5 68,3 76,8

A notable result is that Claude Opus 4.6 marginally outperforms Opus 4.7 in the overall score (85.6% vs 85.2%), despite 4.7 being the more recent model. Detailed analysis reveals this inversion concentrates in gray-box scenarios: Opus 4.6 exhibits slightly superior uncertainty calibration in partial-information contexts, while 4.7 shows a bias toward high-confidence paths that are not always the most productive. In white-box scenarios, 4.7 recovers the marginal advantage (91.8% vs 91.3%), suggesting the difference lies in strategy under uncertainty, not raw reasoning capability.

Fig. 1: Success Rate by Model and Category (%)

100806040200
Black-Box
Gray-Box
White-Box
Chains
Opus 4.8 GPT-5.5 Opus 4.6 Opus 4.7 Sonnet 4.6

TABLE II: Attack Chain Performance by Complexity

Model 3 Stages (n=18) 4 Stages (n=12) 5+ Stages (n=10) Avg. Time (min) Avg. Actions
Claude Opus 4.8 95,2% 89,6% 85,0% 18,3 142
GPT-5.5 92,1% 86,4% 80,0% 22,7 168
Claude Opus 4.6 88,7% 79,2% 72,0% 24,1 187
Claude Opus 4.7 87,9% 78,8% 73,0% 23,5 179
Claude Sonnet 4.6 76,3% 62,5% 56,0% 31,2 234

Table II reveals a clear correlation between the model's multi-step reasoning capability and performance on long chains. Claude Opus 4.8 maintains 85% success even on 5+ stage chains, while Sonnet 4.6 drops to 56%. GPT-5.5 shows competitive performance on 3-stage chains (92.1%) but diverges significantly on longer chains (80.0% on 5+ stages), suggesting differences in the ability to maintain adversarial context across extended exploitation sequences.

TABLE III: Detection Rate by Vulnerability Category (%)

Category Opus 4.8 GPT-5.5 Opus 4.6 Opus 4.7 Sonnet 4.6
RCE 94,3 90,1 86,7 85,9 74,2
SSRF 91,7 87,5 83,3 84,1 70,8
SQL Injection 96,8 93,5 91,9 90,3 85,5
Priv. Escalation 88,2 84,7 79,4 78,8 64,7
Auth Bypass 93,5 89,1 85,9 86,7 76,6
XXE/SSRF Chain 87,5 82,3 76,0 75,5 58,3
Container Escape 82,4 76,5 70,6 71,8 52,9
AD Exploitation 85,7 80,0 74,3 73,5 60,0

TABLE IV: Ablation Study — Impact of Architectural Components (Opus 4.8)

Configuration Overall Success % Chains % Avg. Time Avg. Actions
YAGA Full 92,3 91,2 18,3 min 142
No Stigmergy (central orchestrator) 84,7 76,5 26,1 min 203
No RAG (no playbooks) 81,2 71,8 29,4 min 228
No Curiosity (no RND) 86,1 78,3 22,7 min 176
No Deduplication (scatter without dedup) 88,5 84,2 20,1 min 156
Single Agent (no multi-agent) 72,6 58,5 38,2 min 312

The ablation study demonstrates that each component contributes significantly to performance. Removing stigmergy — replacing it with a prescriptive central orchestrator — results in the greatest degradation on attack chains (91.2% → 76.5%), confirming that chain emergence is fundamentally facilitated by decentralized coordination. Removing RAG impacts both overall success and time, indicating that retrieved playbooks significantly accelerate exploitation.

TABLE V: Comparison with Traditional Tools (Black-Box Scenarios)

Tool Vulns Detected Chains Identified False Positives Avg. Time
YAGA (Opus 4.8) 94,3% 91,2% 3,2% 18,3 min
YAGA (GPT-5.5) 90,1% 87,8% 4,1% 22,7 min
Metasploit + Nmap 67,8% 12,3% 8,7% 45+ min*
Burp Suite Pro 72,1% 8,5% 11,2% 60+ min*
Nuclei Templates 78,4% 5,2% 6,8% 15,2 min

* Traditional tool times exclude manual configuration and human analysis.

VIII. Analysis and Discussion

A. Opus 4.6 vs 4.7 Anomaly

The marginal superiority of Claude Opus 4.6 over 4.7 warrants detailed analysis. Investigation of execution traces reveals that Opus 4.6 exhibits more diversified exploration behavior in gray-box scenarios: when given partial information (e.g., low-privilege credentials), 4.6 tends to use that information as a pivot point for lateral exploration, while 4.7 shows a stronger tendency to explore vertically (direct privilege escalation). In environments where direct escalation is not viable but lateral movement reveals alternative paths, this strategy difference favors 4.6.

Quantitatively, Opus 4.6 attempted 23% more unique actions in gray-box scenarios compared to 4.7, suggesting less tendency to get stuck in retry loops on failed paths. This behavior is consistent with a more conservative uncertainty calibration that leads the model to abandon unproductive paths more quickly.

B. GPT-5.5 Advantage in Specific Scenarios

GPT-5.5, despite being inferior to Opus 4.8 across all aggregate categories, shows notable point advantages. In scenarios involving simulated and complex JavaScript code analysis, GPT-5.5 equals or surpasses Opus 4.8. In SQL injection scenarios on less common databases (e.g., CockroachDB, YugabyteDB), GPT-5.5 demonstrates superior familiarity with non-mainstream SQL dialects, achieving 95.2% vs Opus 4.8's 93.1% on that specific subset.

C. Stigmergy Efficiency

Stigmergic coordination demonstrates two fundamental advantages over the central orchestrator: (1) fault resilience — when an agent fails or gets stuck, the others continue operating independently, and pheromone decay naturally deprioritizes the failed agent's path; (2) unanticipated chain emergence — 17% of successful attack chains in the benchmark were not in the RAG playbooks and emerged purely from indirect agent interaction via the blackboard.

IX. Limitations and Future Work

The current benchmark, though extensive, operates in controlled environments that do not fully capture the complexity of real enterprise networks with thousands of hosts. The false positive rate, though low (3.2% for Opus 4.8), requires human validation in production contexts. The computational cost of frontier models, particularly Opus 4.8, limits applicability in engagements with significant budget constraints.

Future work includes: benchmark expansion to environments with active defenses (IDS/IPS, WAF, EDR), incorporation of transfer learning across engagements, and evaluation of open-source models as lower-cost alternatives for non-critical exploitation phases.

X. Conclusion

This work presented a systematic evaluation of LLMs for autonomous pentesting, demonstrating that combining frontier models with stigmergic coordination architectures produces results that significantly outperform both traditional tools and central orchestration architectures. Claude Opus 4.8 establishes the state of the art with a 92.3% overall success rate and 91.2% on complex chains.

The anomaly between Opus 4.6 and 4.7 illustrates that general model capability metrics do not perfectly predict performance on adversarial reasoning tasks, where uncertainty calibration may be more important than raw capability. The most significant contribution is the demonstration that attack chains can emerge organically from independent agents coordinating via shared artifacts, without the need for central prescription — a result with implications for both offensive security and multi-agent system design in general.

References

[1] D. Xu et al., "PentestGPT: An LLM-empowered Automatic Penetration Testing Tool," arXiv:2308.06782, 2023.
[2] J. Happe et al., "ReaperAI: An Autonomous Agent Framework for Automated Penetration Testing," IEEE S&P Workshop, 2024.
[3] S. Fang et al., "LLM Agents Can Autonomously Hack Websites," arXiv:2402.06664, 2024.
[4] E. Bonabeau et al., "Stigmergy: A Universal Coordination Mechanism for Indirect Communication," Swarm Intelligence, vol. 13, 1999.
[5] Z. Xu et al., "AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks," arXiv:2403.01038, 2024.
[6] J. Schulman et al., "Proximal Policy Optimization Algorithms," arXiv:1707.06347, 2017.
[7] Y. Burda et al., "Exploration by Random Network Distillation," ICLR, 2019.
[8] A. Ridley, "Machine Learning for Autonomous Cyber Operations: A Survey," J. of Autonomous Intelligence, 2024.
[9] M. Schwartz et al., "Autonomous Penetration Testing using Reinforcement Learning," USENIX Security, 2025.
[10] MITRE, "ATT&CK Framework v15," https://attack.mitre.org/, 2025.
[11] S. Zhou et al., "Language Agent Tree Search," NeurIPS, 2024.
[12] T. Gallagher et al., "Coverage-Guided Autonomous Penetration Testing," ACM CCS, 2025.