Agentic pentesting challenges

If you manage a pentest team of five or more people, some of them are already using AI on engagements. You probably encouraged it (or perhaps not?).

The individual proof-of-concepts and initial vibe-coding looked good: faster finding descriptions, expanded remediation, severity suggestions that were mostly right. But you have also noticed that the output from two testers on the same engagement no longer reads like it came from the same firm.

We get into the four organizational problems that show up when AI agents enter pentest workflows, explain why none of them are solved by picking a better LLM model, and point at the structural fix that the software engineering profession already figured out (fine, is in the process of figuring out).

Key Takeaways

  • AI tools on individual laptops with individual prompts amplify each tester's style rather than the team's standard. The inconsistency problem that existed before AI gets worse, not better.
  • Agents running in parallel on the same engagement cannot see each other's findings. The composite picture that matters to the client has to be assembled manually, which consumes the time AI was supposed to save.
  • A senior tester's prompt library, severity calibration, and context-setting represent years of judgment. When they leave, that judgment leaves with them, because none of it lives in the team's shared infrastructure.
  • Software engineering already went through this transition with GitHub Copilot. Individual productivity went up. Team-level consistency degraded until organizations enforced standards at the repository level. The fix was not a better model. It was a shared operational layer.
  • To adapt, teams that combine human and agent testing have to collaborate on a platform that makes AI contributions compound across the team rather than evatorate at the engagement boundary.

Does this apply to you?

This piece is for you if you run or manage a pentest team where multiple people are using AI tools, and you have started noticing output inconsistency, duplicated effort, or the sense that individual AI gains are not translating into team-level improvement. You do not need to be using any specific platform. The problems described here apply to any team where AI has entered the workflow without a shared operational layer underneath it.

Skip this if you are a solo practitioner. The context-sharing and knowledge-retention problems in this piece require at least two people contributing to the same engagement. Your challenges with AI in pentesting are real but different.

The developer parallel already happened

When GitHub Copilot went into general availability in 2022, the question organizations asked was whether it would replace developers. Four years later, the answer is clear: it did not. What changed is what developers do. The best engineers today are not typing code line by line. They are reviewing what the AI generated, making architectural decisions, and deciding what to inspect closely and what to trust. The coding got faster. The judgment got more important.

GitHub's own research with Accenture (2024) documented what happened at the team level: individual productivity went up, but code-style consistency across teams degraded unless the organization enforced standards at the repository level. The AI made each developer more productive in isolation. It did not make the team more consistent unless there was infrastructure to enforce that consistency.

Security testers are at the same inflection point now. The question is not whether AI will replace pentesters. It will not, for the same reasons Copilot did not replace developers. But the job will change. The testing will get faster. The judgment, synthesis, and quality assurance will get more important.

The transition is already happening. In April 2026, the Dradis team ran Claude as a security engineer on our own codebase - and we're not even real pentesters. Claude did reconnaissance, authentication testing, authorization testing, IDOR checks, API security assessment, and session management review. It found real vulnerabilities. Humans triaged, validated, and added context. That workflow, approximately, is what this profession looks like in 18 months. Most teams are not structurally ready for it.

Agents on individual laptops produce team-level inconsistency

Each tester configures their own AI tool. Their own prompts, their own model, their own context window. When tester A runs Claude with a custom system prompt and tester B runs GPT-5.5 with a different prompt, they produce different output for identical findings. Severity calibration varies. Remediation depth varies. Vocabulary varies.

This is the consistency problem that existed before AI, made worse by AI.

Before, a senior tester and a junior tester would produce different-quality output for the same finding. Now they produce different-quality output mediated through different AI configurations, each with its own "opinions" about what a critical finding looks like, how much remediation detail is appropriate, and what language to use.

The practical consequence: two testers on the same engagement produce reports that read as though they came from different firms.

  • Different severity for the same finding class.
  • Different remediation depth.
  • Different vocabulary.

A client receiving this report does not see "AI-assisted efficiency." They see a consultancy that cannot agree with itself on what a medium-severity finding looks like.

The GitHub Copilot research showed the same pattern in software engineering. Teams that let each developer use AI independently, with no shared standards enforced at the infrastructure level, saw style fragmentation across the codebase. The fix was not a better AI model. It was enforcing team conventions through the shared platform (the repository, CI/CD, linters) rather than relying on individual discipline.

For pentest teams, the equivalent is an operational platform that enforces your team's severity definitions, remediation standards, and reporting voice on all output, regardless of whether a human or an agent produced it. If you are already managing team consistency through standardized processes, AI agents are the next coordination challenge.

No shared context between agents mid-engagement

Agent A discovers a SQL injection on the login form. Agent B is testing the authentication flow separately. Agent B does not know the SQL injection exists, so the combined picture ("this login form is vulnerable to SQLi, the session tokens are in the URL, and the password reset flow has an IDOR") never gets synthesized unless a human manually connects the dots.

In a human team, synthesis happens through standups, shared notes, and the project board, or Dradis of course.

For agents running in parallel, no equivalent exists unless the platform provides it. Each agent operates in its own context window with no visibility into what the other agents or testers have already found.

This creates two problems:

  1. First, duplication: two agents independently discover and write up the same finding with different descriptions. Someone has to reconcile that before it goes to the client.

  2. Second, missed connections: vulnerabilities that are individually medium-severity but collectively represent a critical attack chain never get connected because no single agent has the full picture.

The findings management layer becomes the multi-agent coordination layer. If it does not exist, agents produce disconnected, overlapping output that a human still has to reconcile before it becomes a report. The time AI saved on individual finding descriptions gets consumed by the triage and deduplication work that follows.

Organizational knowledge walks out the door

A senior pentester's AI agent is tuned to their knowledge. Their prompts encode their judgment about what severity means, what good remediation looks like, what level of detail is appropriate for this client type. Those prompts took months to refine. They represent the difference between "medium-severity SQL injection" and a description that a client's development team can act on.

When that senior tester leaves the firm, the prompts leave with them. They are on the tester's laptop, in their personal notes, maybe in a shared folder nobody else knows about.

Tacit knowledge, the kind that lives in practitioners' heads, is the hardest to retain and the most valuable to accumulate (we've covered this in our Compounding Expertise in Pentest Teams article).

AI agents running on individual laptops are tacit knowledge systems. Each engagement's worth of refined prompts, severity calibration, and context dies at the engagement boundary or at the employment boundary.

Compare this to how code works in a development organization. When a developer leaves, their code stays in the repository. Their architectural decisions are visible in the commit history. Their coding patterns are documented in the team's style guide. The repository is an organizational knowledge system, not a personal one.

Pentest teams need the same property for their AI-assisted workflow. The finding descriptions that worked well, the severity logic that clients responded to, the remediation depth that matched each client type: all of this needs to live in the team's shared operational infrastructure, not on individual laptops. This is the same argument that applies to making the most of your team's knowledge and experience in traditional engagements, scaled up by the fact that AI makes knowledge loss happen faster.

Two testers, one engagement, completely different output

This is the visible symptom of the three problems above. A practice manager reviews a report and finds that section A (written by tester one with their AI configuration) and section B (written by tester two with theirs) read like they were authored by different firms.

  • The severity ratings do not align: nne tester's AI defaults to CVSSv3 base scores; the other's has been prompted to consider business context.
  • The remediation sections differ in depth: one provides implementation-level guidance, the other gives generic advice.
  • The language itself is different: one section uses passive voice throughout (the AI's default), the other is direct and active (because the tester's prompt said so).

The client does not see four brilliant AI agents working in parallel. They see a report that looks like it was assembled from parts by people who never spoke to each other. For a consultancy, this is a credibility problem that no individual AI productivity gain can offset.

This is not a failure of the AI tools. Each agent did what it was asked to do. The failure is in the coordination layer underneath. Without a shared operational platform that enforces team standards on all contributions, whether from humans or agents, each AI agent invents from scratch. The team's accumulated best practices are invisible to it.

The operational platform is the fix, not the model

The coordination layer for AI-generated code is the repository: the shared, versioned, team-controlled environment that gives AI output somewhere to land, somewhere to be validated against team standards, and somewhere to compound into the team's permanent knowledge. The AI is powerful; the platform is what makes it reliable across a team over time.

Pentest teams need the same layer. Not a new category of tool. The same operational platform that has always handled collaboration, findings management, and reporting for multi-person engagements. The role does not change when agents join the team: findings still need to land somewhere. Context still needs to be shared. Standards still need to be enforced. Knowledge still needs to accumulate so that project 50 benefits from projects 1 through 49.

The same security knowledge system than before, what changes is who contributes. A platform that handles human-to-human collaboration handles human-to-agent collaboration by the same mechanism: a shared, structured operational environment where every contribution is visible, normalized against the team's standards, and retained permanently.

The properties that matter for this layer are straightforward:

  • Self-hosted, so findings and agent context do not route and leak through additional vendor infrastructure and 3rd parties
  • Open-source or inspectable, so there is no trust-me in the data path between agents and your findings
  • Long track record, so the edge cases your team encounters are already solved rather than discovered in production
  • Built for multi-person teams, because the coordination problem only exists when more than one contributor (human or agent) works against the same engagement

Platforms that have been in continuous use since before AI pentesting was a conversation have a track record that new entrants cannot manufacture. They have already solved the collaboration, consistency, and knowledge-retention problems for human teams. Those are the same problems that reappear when agents join the team. The platform's relevance did not change. The team composition did.

Dradis has been that operational layer for pentest teams since 2007: open-source, self-hosted, and built for the team-level coordination problems this piece describes. If the diagnosis above matches what your team is experiencing, try the Community Edition and see whether the infrastructure you need is already built. It's free.

Practical next steps

  • Audit your current AI workflow. How many different AI configurations are active on your team right now? If the answer is "one per tester," you have the inconsistency problem described above. Map who is using what, with which prompts, and whether any of that configuration is shared.
  • Test for output consistency. Give two testers the same finding and have each run it through their AI workflow. Compare the output side by side. If severity, remediation depth, and language differ, the platform gap is real.
  • Centralize what you can. If your team has a shared findings library, approved severity definitions, or standardized remediation templates, those are your enforcement mechanisms. AI output normalized against shared standards is consistent. AI output generated from individual prompts is not.
  • Think about the bus factor. If your best tester left tomorrow, what would the team lose? If the answer includes "their prompt library" or "the way they have their AI configured," that knowledge is in the wrong place.
  • Watch the developer parallel. Engineering organizations that solved AI consistency did it through shared infrastructure, not better models. The same logic applies here. Follow what worked.

The operational layer: - Shadow AI in pentesting: what happens when your tester uses ChatGPT with client findings — the data exposure side of the same organizational problem - Self-hosted AI for pentest reporting — the architectural answer to the data sovereignty question AI introduces

Team management: - Making the most of your team's knowledge and experience — the knowledge-retention problem before AI entered the picture - Penetration testing management — running consistent multi-person engagements

Further reading: - GitHub Research: Quantifying Copilot's Impact with Accenture — the enterprise-level data on what happens when AI tools meet team coordination

Frequently asked questions

Is the inconsistency problem specific to AI, or did it exist before?

It existed before. Two testers have always produced different output for the same finding. What AI does is amplify the variance, because each tester's AI configuration adds its own layer of "opinions" about severity, language, and remediation depth. Before AI, a style guide and peer review could keep output reasonably consistent. With AI, the variance happens faster and at higher volume, which makes it harder to catch in review and more visible to the client.

Can a shared prompt library solve the consistency problem?

It helps but does not solve it. A shared prompt library standardizes the instructions given to the AI, but it does not standardize the context the AI operates on. Two testers using the same prompt on different models, or on the same model with different context windows, will still produce different output. The prompt is one input. The findings context, the team's severity definitions, and the project-specific standards are the others. A prompt library addresses the first; an operational platform addresses all of them.

How does the developer parallel apply if pentest teams are much smaller than engineering teams?

The scale is different but the mechanism is the same. A 5-person pentest team has the same coordination problem as a 50-person engineering team using Copilot: each individual contributor is more productive, but without shared infrastructure enforcing team standards, the aggregate output fragments. The consistency threshold for pentest reports is arguably higher than for code, because the client reads the report directly. Code inconsistency is visible to developers. Report inconsistency is visible to the client.

What if my team is only two people? Do these problems still apply?

The context-sharing and consistency problems show up at two contributors or more, whether those contributors are humans, agents, or a mix. On a two-person engagement where both testers use AI independently, you will see the output-inconsistency problem in the combined report. The knowledge-retention problem is less acute at two people, because tacit knowledge transfers more easily in small teams. At five or more, it becomes structural.

When will AI agents replace pentesters?

The evidence from software engineering, where the transition is further along, suggests they will not. GitHub Copilot did not replace developers four years after general availability. It changed what developers spend their time on. The same pattern is playing out in security testing: AI handles the generation and expansion work that humans used to do manually; humans handle the judgment, validation, and synthesis that AI cannot do reliably. The job changes. The role does not disappear. The teams that adapt their workflow and infrastructure for this shift will outperform those that do not.

Try Dradis Community Edition: the open-source operational platform for pentest teams

Seven Strategies To Differentiate Your Cybersecurity Consultancy

You don’t need to reinvent the wheel to stand out from other cybersecurity consultancies. Often, it's about doing the simple things better, and clearly communicating what sets you apart.

  • Tell your story better
  • Improve your testimonials and case studies
  • Build strategic partnerships

Loading form...

Your email is kept private. We don't do the spam thing.