April 13, 2026 Vibers Editorial 10 min read

AI Can Write Code — But Who Verifies It?

Q: What real incidents have happened because AI-generated code wasn't verified?

Three high-profile cases stand out. Quittr, a viral porn-addiction app, exposed 600,000+ users including minors via a misconfigured Firebase database — four months after being notified, the founders still hadn't patched it. Base44 had a platform-level SSO bypass allowing unauthorized users to register for private apps, exposing every app built on the platform regardless of per-app code review. Moltbook, an AI-agent social network, exposed 1.5 million API tokens and 35,000+ email addresses; its founder stated 'I didn't write a single line of code.'

AI tools now generate 30–50% of code at leading engineering organizations. Yet only 12% of those organizations apply the same security standards to AI-generated code as to human-written code. The industry is shipping faster than it is verifying — and the consequences are already showing up in production.

Key Takeaways

AI-generated code contains 2.74x more security vulnerabilities than human-written equivalents, per Veracode's 2025 report across 100+ LLMs.
Only 12% of organizations apply the same security standards to AI code as traditional code, despite 93% using it.
The "responsibility gap" is structural: founders assume AI verified it, AI has no accountability, developers trust the output, nobody checks.
Verification is not the same as testing. Tests check behavior on observed inputs. Verification checks compliance with the specification.
Real incidents — Quittr (600K+ users exposed), Base44 (platform-level SSO bypass), Moltbook (1.5M API tokens leaked) — all share one root cause: AI wrote the code, no human reviewed it seriously.
The emerging answer is human-in-the-loop review: a qualified person who reads AI-generated diffs before they ship, not after they break.

1. The Trust Gap: AI Ships Faster Than Humans Can Verify

There is a speed asymmetry at the heart of modern software development. AI writes code in seconds. Meaningful verification takes hours. And as AI tools become more capable — generating not just single functions but entire features, entire backends, entire apps in a weekend — that asymmetry grows.

GitHub Copilot now generates over 50% of code in enabled repositories. Collins English Dictionary named "vibe coding" its Word of the Year for 2025. At Google and Microsoft, 25–30% of all code is already AI-generated. The question was never whether AI could write code. The question was always what happens to that code afterward.

By June 2025, AI-generated code was adding over 10,000 new security findings per month across studied enterprise repositories — a 10x increase from December 2024. (Source: security vendor analysis, 2025)

The trust gap is not just about security. It is about the gap between "it works in the demo" and "it behaves correctly in production, for every user, under every condition." AI models are optimized to produce plausible output quickly, not to ask the critical questions — "who should be allowed to access this endpoint?", "what happens if this field is empty?", "does this auth check actually get enforced, or just checked?" — that a thoughtful developer would ask.

"AI has made code generation nearly effortless, but it has created a critical trust gap between output and deployment." — Tariq Shaukat, CEO, Sonar

The uncomfortable truth is that the software industry has solved this problem before. We moved from waterfall sign-offs to CI pipelines. We went from manual QA to automated test suites. Each transition required new tooling and new habits. What we have not yet built — at scale — is a verification layer that matches the pace of AI generation. And we are already paying for the gap.

2. Who Is Actually Responsible? (The Answer Is Nobody, Right Now)

The Responsibility Gap: A structural accountability vacuum in which the AI tool has no legal liability, the developer trusts the AI's output, the founder believes the tool "handles" security, and no independent party has verified anything before code reaches users.

Ask a founder who built their MVP with Cursor or Lovable who verified the security. Most will say something like: "the AI checks for issues." Ask the AI tool. It will tell you, accurately, that it produces code suggestions and that the developer is responsible for review. Ask the developer. They will say they reviewed it — meaning they read it, found no obvious syntax errors, and merged it.

None of these answers is wrong, exactly. Together they add up to nobody actually being responsible.

Stakeholder	Who THINKS is responsible	Who IS responsible	What actually happens
Founder / product owner	"The AI tool / Cursor / Lovable"	The organization shipping the product	Assumes AI handled it, ships without review
AI coding tool	"The developer using the tool"	No legal liability for tool vendors	Generates plausible code, disclaims responsibility
Developer / vibe coder	"I reviewed it" (scanned the diff)	The developer who merges the code	Shallow review — "looks fine," merge
Security / QA team	"Dev should have caught it"	Shared responsibility with dev	Often not consulted for AI-generated features
Platform (Firebase, Supabase, etc.)	"Developer configures permissions"	Platform has default-open configs	AI often generates code with default admin-level access

The OpenSSF — the Open Source Security Foundation — has been explicit about this: "You are the developer, and AI is the assistant. You are responsible for any harm caused by the code." That principle is correct but it has not been operationalized. Having a policy is not the same as having a process.

93% of organizations now use AI-generated code in their development workflows. Only 27% enforce strict governance over how AI tools are used. 68% lack visibility into which AI tools their own developers are using. (Source: enterprise security survey data, 2025–2026)

What this creates, at the system level, is a situation where speed is measured (cycle time, deploys per day, features shipped) and verification is not. Organizations that don't measure verification capacity will default to treating it as free — which means it doesn't happen.

3. What Verification Actually Means (Not Just Tests)

Here is the most important misunderstanding in this entire conversation: running tests is not the same as verifying code.

Tests check that code does what you observed it doing on a set of inputs you thought of. If the AI also wrote the tests — which it often does — you have circular validation. The code and the tests were generated from the same model, with the same blind spots, optimizing for the same objective: producing something that looks right.

Verification vs. Testing: Testing checks observed behavior on sample inputs. Verification checks compliance with the specification — that the code behaves correctly for every user, every input, every permission level, and every failure mode the spec requires. A demo can pass all tests while failing verification on every one of those dimensions.

Real verification has four distinct layers that tests alone cannot cover:

Spec compliance — Does the code actually implement what was specified? Not just "does it seem to work" but "does it match the product requirements including edge cases the spec defined?"
Security review — Are auth flows correct? Are permission scopes restricted to what is necessary? Are secrets handled safely? Are all inputs validated? Does the code follow OWASP Top 10 patterns?
Architectural review — Does the structure align with the system's design? Are there introduced dependencies that create supply chain risk? Does the code create technical debt that will compound?
Behavioral testing under adversarial conditions — What happens when a malicious user tries to access data they shouldn't? When required fields are missing? When the service is called with unexpected inputs?

Researchers at Leonardo de Moura (Microsoft Research) have argued that the only true solution at scale is formal verification — mathematical proofs of correctness that cover every possible input rather than a sample. AWS has applied this to its Cedar authorization engine; Microsoft to its SymCrypt cryptographic library. Formal verification is real, and it is being adopted at the infrastructure level. But for the vast majority of applications shipping today with AI-generated code, formal proofs are not yet practical.

In the interim, there is a more accessible approach: a qualified human who reads the diffs before they ship.

4. Real-World Consequences When Nobody Verifies

The incidents are not hypothetical. They are already happening, and they share a recognizable pattern: fast build, no review, breach.

Incident: Quittr — 600,000+ Users Exposed (2025)

Quittr, a viral app helping users quit porn addiction, reached 1.5 million downloads and approximately $500,000/month in revenue. A security researcher discovered that a Firebase misconfiguration allowed anyone authenticated as any user to access the backend database — exposing usage metrics, behavioral data, and records for approximately 100,000 minors. The researcher notified the founders on September 10, 2025. Four months later, the vulnerability was still not patched. When 404 Media contacted co-founder Alex Slater, he wished them a good day and hung up. The vulnerability was eventually fixed only after the story was published.

Incident: Base44 — Platform-Level SSO Bypass (2025)

The no-code platform Base44 contained a flaw that allowed unauthorized users to bypass access controls and register for private applications. This was not a per-app issue — it was a platform-level vulnerability, meaning every app built on Base44 was exposed regardless of how carefully any individual developer reviewed their own code. Base44 patched within 24 hours, but the exposure window was real and extended to every app on the platform.

Incident: Moltbook — 1.5 Million API Tokens Leaked (2025)

Moltbook, an AI-agent social network, exposed 1.5 million API authentication tokens, 35,000+ email addresses, and 4,060 private messages. The root cause: a public API key in the client-side bundle, combined with disabled Row Level Security on the database. The founder's public statement: "I didn't write a single line of code for Moltbook. I just had a vision..." This is the vibe coding trust gap expressed perfectly. Having a vision is not the same as having a secure product.

A broader study by Escape.tech reviewed 5,600 vibe-coded apps and found over 2,000 vulnerabilities, 400+ exposed secrets, and 175 instances of personally identifiable information exposed through public endpoints. These are not edge cases. They are the baseline when code ships without meaningful verification.

53% of teams shipping AI-generated code later discovered security issues that passed initial review. AI-generated code fails XSS defense mechanisms in 86% of cases tested (Georgetown CSET). (Source: multiple 2025 security studies)

The pattern is consistent: AI-generated code does not fail randomly. It fails in patterns — missing input validation, credentials in client bundles, auth checks that exist in code but are never enforced at runtime. The patterns are predictable. That means they are also preventable, by someone who knows what to look for.

5. The Emerging Role: Human-in-the-Loop Reviewer

In any fast-moving system where automation can generate more output than humans can process, the critical design question is: where does the human checkpoint live, and who performs it?

For AI-generated code, the checkpoint that is missing is not automated. Automated scanners catch known patterns. They find the SQL injection that looks like the SQL injections they were trained on. They do not catch the permission model that looks correct to a static analyzer but allows a user to access another user's data at runtime. They do not notice that the spec said "users can only see their own orders" and the implementation has no such restriction. They do not flag the auth token that is stored correctly in the backend and insecurely in the client bundle.

"68% of developers say they trust AI-assisted code reviews more than peer reviews for catching syntax and mechanical issues — but only 22% say they trust AI for architectural or design-level reviews." — Developer survey data, 2025

The human-in-the-loop reviewer is not a replacement for automated tools. It is the layer above them: the person who reads the diff with the spec in one hand and asks whether this code actually does what the product promised users it would do.

This role is not new. What is new is its importance. Before AI coding tools, developers were intimate with every line they wrote. They made hundreds of small security decisions consciously. With AI coding tools, those decisions are made invisibly — by a model optimizing for plausibility, not correctness. The human reviewer is the person who makes those invisible decisions visible again.

Researchers at Stanford and MIT analyzed this dynamic directly: "Traditional SDLC mechanisms — peer review, code author familiarity, incremental change management — function differently when a developer is primarily reviewing AI-generated output rather than reasoning through code they authored." The familiar checkpoints exist, but they are not doing the same work.

6. How to Close the Verification Gap

There is no single intervention that solves this. But there is a hierarchy of impact:

Move the checkpoint upstream, not downstream

The most expensive place to find a vulnerability is in production, after it has been exploited. The second most expensive is after shipping but before exploitation. Verification before merge is cheaper than both. Verification built into the specification — before any code is written — is cheapest of all, because it defines what correct behavior looks like before the AI generates anything.

Separate the writer from the verifier

If the same agent writes both the code and the tests, you have moved the problem, not solved it. The verifier needs independence from the generator. This is why QA teams traditionally do not report to engineering managers, and why human reviewers bring value that is structurally different from more AI tooling.

Define "done" as spec-compliant, not demo-passing

The "demo passes, production breaks" pattern is almost always a definition problem. The demo proved the happy path works. Production requires every path to work. Closing the gap means writing acceptance criteria before building, and verifying against them, not against impressions.

Track AI-specific quality metrics

Defect density from AI-generated code. Regression rates in AI-touched modules. Review load per reviewer. Merge confidence scores. These are the numbers that will tell you whether your verification is working. Cycle time tells you how fast you are shipping. It does not tell you what you are shipping.

Apply external review at security boundaries

Auth flows. Permission models. Data access patterns. Payment processing. Any code path where a mistake affects user data or financial outcomes. These are the paths where AI-generated code fails at the highest rate and where the consequences of failure are highest. External review at these boundaries — by someone who did not write the code and is not optimistic about it — is the highest-leverage intervention available today.

Vibers closes the verification gap

Human review for every push — spec compliance, security patterns, architectural soundness. Not bots. Not another scanner. A qualified person who reads your diffs before they ship.

Install Vibers on GitHub

FAQ: AI Writes Code — Who Verifies?

Who is responsible for verifying AI-generated code?

Ultimately, the developer or team that ships the code is responsible — not the AI tool. The OpenSSF has been explicit: "You are the developer, and AI is the assistant. You are responsible for any harm caused by the code." In practice, this means establishing human review checkpoints before AI-written code reaches production, particularly for security-sensitive paths.

Why is AI-generated code harder to verify than hand-written code?

AI-generated code is harder to verify because it looks plausible at a glance. Developers who didn't write the code tend to perform shallower reviews — researchers observe patterns like "I Accept All always, I don't read the diffs anymore." The code may pass unit tests while concealing subtler issues: missing input validation, embedded credentials, overly broad permissions, or auth checks that exist in code but are never actually enforced at runtime.

What percentage of AI-generated code contains security flaws?

According to Veracode's 2025 GenAI Code Security Report, 45% of AI-generated code contains security vulnerabilities. Veracode tested 100+ LLMs and found AI code contains 2.74x more vulnerabilities overall, with Java showing a 72% security failure rate. A joint Stanford/MIT analysis of over 2 million AI-generated code snippets found 14.3% contained at least one security vulnerability, compared to 9.1% for equivalent human-written code.

What does "verifying AI code" actually mean beyond running tests?

Tests check that code does what you observed it doing in a sample run. Verification checks that code does what the spec requires — for every input, every edge case, every user role, every dependency. This includes spec compliance, security review (auth flows, permission scopes, secret handling), architectural review, and behavioral testing under adversarial conditions. Tests can pass while all four of these fail simultaneously.

What real incidents have happened because AI-generated code wasn't verified?

Three high-profile cases: Quittr exposed 600,000+ users including minors via a misconfigured Firebase database — founders were notified in September 2025 and hadn't patched it four months later. Base44 had a platform-level SSO bypass exposing every app on the platform. Moltbook exposed 1.5 million API tokens and 35,000+ email addresses; its founder publicly stated "I didn't write a single line of code." All three cases share a pattern: fast build, no verification, breach.

What is a human-in-the-loop code reviewer and do I need one?

A human-in-the-loop reviewer is an expert who inspects AI-generated code before it ships — not just for syntax, but for spec compliance, security patterns, and architectural soundness that automated tools miss. You need one if you are shipping AI-generated code to real users, especially if you lack deep security expertise in-house. The cost of one review session is significantly lower than the cost of a data breach, forced app removal, or user loss from a production incident.

Vibers Editorial

Vibers is a human-in-the-loop code review service for AI-generated software. We review real pull requests — spec compliance, security, architecture — before they reach production. This article draws on industry research from Veracode, CodeRabbit, Georgetown CSET, Stanford/MIT, and incident reporting from 404 Media and Cybernews.