AI Can Write Code — But Who Verifies It?
AI tools now generate 30–50% of code at leading engineering organizations. Yet only 12% of those organizations apply the same security standards to AI-generated code as to human-written code. The industry is shipping faster than it is verifying — and the consequences are already showing up in production.
Key Takeaways
- AI-generated code contains 2.74x more security vulnerabilities than human-written equivalents, per Veracode's 2025 report across 100+ LLMs.
- Only 12% of organizations apply the same security standards to AI code as traditional code, despite 93% using it.
- The "responsibility gap" is structural: founders assume AI verified it, AI has no accountability, developers trust the output, nobody checks.
- Verification is not the same as testing. Tests check behavior on observed inputs. Verification checks compliance with the specification.
- Real incidents — Quittr (600K+ users exposed), Base44 (platform-level SSO bypass), Moltbook (1.5M API tokens leaked) — all share one root cause: AI wrote the code, no human reviewed it seriously.
- The emerging answer is human-in-the-loop review: a qualified person who reads AI-generated diffs before they ship, not after they break.
1. The Trust Gap: AI Ships Faster Than Humans Can Verify
There is a speed asymmetry at the heart of modern software development. AI writes code in seconds. Meaningful verification takes hours. And as AI tools become more capable — generating not just single functions but entire features, entire backends, entire apps in a weekend — that asymmetry grows.
GitHub Copilot now generates over 50% of code in enabled repositories. Collins English Dictionary named "vibe coding" its Word of the Year for 2025. At Google and Microsoft, 25–30% of all code is already AI-generated. The question was never whether AI could write code. The question was always what happens to that code afterward.
The trust gap is not just about security. It is about the gap between "it works in the demo" and "it behaves correctly in production, for every user, under every condition." AI models are optimized to produce plausible output quickly, not to ask the critical questions — "who should be allowed to access this endpoint?", "what happens if this field is empty?", "does this auth check actually get enforced, or just checked?" — that a thoughtful developer would ask.
"AI has made code generation nearly effortless, but it has created a critical trust gap between output and deployment." — Tariq Shaukat, CEO, Sonar
The uncomfortable truth is that the software industry has solved this problem before. We moved from waterfall sign-offs to CI pipelines. We went from manual QA to automated test suites. Each transition required new tooling and new habits. What we have not yet built — at scale — is a verification layer that matches the pace of AI generation. And we are already paying for the gap.
2. Who Is Actually Responsible? (The Answer Is Nobody, Right Now)
Ask a founder who built their MVP with Cursor or Lovable who verified the security. Most will say something like: "the AI checks for issues." Ask the AI tool. It will tell you, accurately, that it produces code suggestions and that the developer is responsible for review. Ask the developer. They will say they reviewed it — meaning they read it, found no obvious syntax errors, and merged it.
None of these answers is wrong, exactly. Together they add up to nobody actually being responsible.
| Stakeholder | Who THINKS is responsible | Who IS responsible | What actually happens |
|---|---|---|---|
| Founder / product owner | "The AI tool / Cursor / Lovable" | The organization shipping the product | Assumes AI handled it, ships without review |
| AI coding tool | "The developer using the tool" | No legal liability for tool vendors | Generates plausible code, disclaims responsibility |
| Developer / vibe coder | "I reviewed it" (scanned the diff) | The developer who merges the code | Shallow review — "looks fine," merge |
| Security / QA team | "Dev should have caught it" | Shared responsibility with dev | Often not consulted for AI-generated features |
| Platform (Firebase, Supabase, etc.) | "Developer configures permissions" | Platform has default-open configs | AI often generates code with default admin-level access |
The OpenSSF — the Open Source Security Foundation — has been explicit about this: "You are the developer, and AI is the assistant. You are responsible for any harm caused by the code." That principle is correct but it has not been operationalized. Having a policy is not the same as having a process.
What this creates, at the system level, is a situation where speed is measured (cycle time, deploys per day, features shipped) and verification is not. Organizations that don't measure verification capacity will default to treating it as free — which means it doesn't happen.
3. What Verification Actually Means (Not Just Tests)
Here is the most important misunderstanding in this entire conversation: running tests is not the same as verifying code.
Tests check that code does what you observed it doing on a set of inputs you thought of. If the AI also wrote the tests — which it often does — you have circular validation. The code and the tests were generated from the same model, with the same blind spots, optimizing for the same objective: producing something that looks right.
Real verification has four distinct layers that tests alone cannot cover:
- Spec compliance — Does the code actually implement what was specified? Not just "does it seem to work" but "does it match the product requirements including edge cases the spec defined?"
- Security review — Are auth flows correct? Are permission scopes restricted to what is necessary? Are secrets handled safely? Are all inputs validated? Does the code follow OWASP Top 10 patterns?
- Architectural review — Does the structure align with the system's design? Are there introduced dependencies that create supply chain risk? Does the code create technical debt that will compound?
- Behavioral testing under adversarial conditions — What happens when a malicious user tries to access data they shouldn't? When required fields are missing? When the service is called with unexpected inputs?
Researchers at Leonardo de Moura (Microsoft Research) have argued that the only true solution at scale is formal verification — mathematical proofs of correctness that cover every possible input rather than a sample. AWS has applied this to its Cedar authorization engine; Microsoft to its SymCrypt cryptographic library. Formal verification is real, and it is being adopted at the infrastructure level. But for the vast majority of applications shipping today with AI-generated code, formal proofs are not yet practical.
In the interim, there is a more accessible approach: a qualified human who reads the diffs before they ship.
4. Real-World Consequences When Nobody Verifies
The incidents are not hypothetical. They are already happening, and they share a recognizable pattern: fast build, no review, breach.
Incident: Quittr — 600,000+ Users Exposed (2025)
Quittr, a viral app helping users quit porn addiction, reached 1.5 million downloads and approximately $500,000/month in revenue. A security researcher discovered that a Firebase misconfiguration allowed anyone authenticated as any user to access the backend database — exposing usage metrics, behavioral data, and records for approximately 100,000 minors. The researcher notified the founders on September 10, 2025. Four months later, the vulnerability was still not patched. When 404 Media contacted co-founder Alex Slater, he wished them a good day and hung up. The vulnerability was eventually fixed only after the story was published.
Incident: Base44 — Platform-Level SSO Bypass (2025)
The no-code platform Base44 contained a flaw that allowed unauthorized users to bypass access controls and register for private applications. This was not a per-app issue — it was a platform-level vulnerability, meaning every app built on Base44 was exposed regardless of how carefully any individual developer reviewed their own code. Base44 patched within 24 hours, but the exposure window was real and extended to every app on the platform.
Incident: Moltbook — 1.5 Million API Tokens Leaked (2025)
Moltbook, an AI-agent social network, exposed 1.5 million API authentication tokens, 35,000+ email addresses, and 4,060 private messages. The root cause: a public API key in the client-side bundle, combined with disabled Row Level Security on the database. The founder's public statement: "I didn't write a single line of code for Moltbook. I just had a vision..." This is the vibe coding trust gap expressed perfectly. Having a vision is not the same as having a secure product.
A broader study by Escape.tech reviewed 5,600 vibe-coded apps and found over 2,000 vulnerabilities, 400+ exposed secrets, and 175 instances of personally identifiable information exposed through public endpoints. These are not edge cases. They are the baseline when code ships without meaningful verification.
The pattern is consistent: AI-generated code does not fail randomly. It fails in patterns — missing input validation, credentials in client bundles, auth checks that exist in code but are never enforced at runtime. The patterns are predictable. That means they are also preventable, by someone who knows what to look for.
5. The Emerging Role: Human-in-the-Loop Reviewer
In any fast-moving system where automation can generate more output than humans can process, the critical design question is: where does the human checkpoint live, and who performs it?
For AI-generated code, the checkpoint that is missing is not automated. Automated scanners catch known patterns. They find the SQL injection that looks like the SQL injections they were trained on. They do not catch the permission model that looks correct to a static analyzer but allows a user to access another user's data at runtime. They do not notice that the spec said "users can only see their own orders" and the implementation has no such restriction. They do not flag the auth token that is stored correctly in the backend and insecurely in the client bundle.
"68% of developers say they trust AI-assisted code reviews more than peer reviews for catching syntax and mechanical issues — but only 22% say they trust AI for architectural or design-level reviews." — Developer survey data, 2025
The human-in-the-loop reviewer is not a replacement for automated tools. It is the layer above them: the person who reads the diff with the spec in one hand and asks whether this code actually does what the product promised users it would do.
This role is not new. What is new is its importance. Before AI coding tools, developers were intimate with every line they wrote. They made hundreds of small security decisions consciously. With AI coding tools, those decisions are made invisibly — by a model optimizing for plausibility, not correctness. The human reviewer is the person who makes those invisible decisions visible again.
Researchers at Stanford and MIT analyzed this dynamic directly: "Traditional SDLC mechanisms — peer review, code author familiarity, incremental change management — function differently when a developer is primarily reviewing AI-generated output rather than reasoning through code they authored." The familiar checkpoints exist, but they are not doing the same work.
6. How to Close the Verification Gap
There is no single intervention that solves this. But there is a hierarchy of impact:
Move the checkpoint upstream, not downstream
The most expensive place to find a vulnerability is in production, after it has been exploited. The second most expensive is after shipping but before exploitation. Verification before merge is cheaper than both. Verification built into the specification — before any code is written — is cheapest of all, because it defines what correct behavior looks like before the AI generates anything.
Separate the writer from the verifier
If the same agent writes both the code and the tests, you have moved the problem, not solved it. The verifier needs independence from the generator. This is why QA teams traditionally do not report to engineering managers, and why human reviewers bring value that is structurally different from more AI tooling.
Define "done" as spec-compliant, not demo-passing
The "demo passes, production breaks" pattern is almost always a definition problem. The demo proved the happy path works. Production requires every path to work. Closing the gap means writing acceptance criteria before building, and verifying against them, not against impressions.
Track AI-specific quality metrics
Defect density from AI-generated code. Regression rates in AI-touched modules. Review load per reviewer. Merge confidence scores. These are the numbers that will tell you whether your verification is working. Cycle time tells you how fast you are shipping. It does not tell you what you are shipping.
Apply external review at security boundaries
Auth flows. Permission models. Data access patterns. Payment processing. Any code path where a mistake affects user data or financial outcomes. These are the paths where AI-generated code fails at the highest rate and where the consequences of failure are highest. External review at these boundaries — by someone who did not write the code and is not optimistic about it — is the highest-leverage intervention available today.
Vibers closes the verification gap
Human review for every push — spec compliance, security patterns, architectural soundness. Not bots. Not another scanner. A qualified person who reads your diffs before they ship.
Install Vibers on GitHub