Blog
7 min readMinhu WangPolished by GPT

Why CoQUIC?

A technical note on using QUIC as a workload for studying LLM-generated systems code and its quality evidence.

QUICAIengineering

Why CoQUIC?

CoQUIC is an experiment about LLM-generated systems software. It is not meant to prove that "models can already write a QUIC stack suitable for deployment." Instead, it uses a sufficiently complex protocol implementation to study a more concrete question:

If an LLM can generate the code, what evidence is needed before we should trust the result?

This is a technical question. The repository building, the tests passing once, or a demo looking plausible are not enough. A transport stack needs to be checked through protocol conformance, interoperability with other implementations, performance testing, fuzzing, static analysis, regression tests, and code review. CoQUIC uses one project to examine these aspects together.

Background

Before starting CoQUIC, I had written a small QUIC implementation as a reference implementation for an undergraduate networking course where I served as a teaching assistant. The course asked students to implement basic data transfer without cryptography. Students could then selectively implement more advanced QUIC features for additional points, with harder features carrying higher weight.

That reference implementation was simplified around teaching goals. It was useful for explaining the basic mechanics of data transfer, but it was not a complete QUIC stack, and at the time I did not have enough time to keep developing it into one.

CoQUIC starts from a different question. Current coding agents are much stronger than the tools available when I wrote that course implementation. I wanted a target: it needed to be large enough to expose what these agents can and cannot do; at the same time, it needed to be close enough to my research background that I could judge the quality of the generated code.

QUIC is a good fit for that role.

Why Choose QUIC

QUIC has several properties that make it a useful way to examine how an LLM coding workflow behaves on systems code:

  • The protocol is specified in detail, so behavior can be compared against RFC text.
  • The implementation has many interacting subsystems: packet parsing, packet protection, recovery, flow control, stream scheduling, connection IDs, timers, path validation, and HTTP/3 integration.
  • Peer interoperability matters. Code that only passes local tests can still fail with another implementation.
  • Performance matters. A transport can be functionally correct and still unusably slow.
  • Many bugs depend on state and timing order, which makes them hard to reduce into a single local unit test.

This is also a field where I have enough background knowledge to inspect generated results. I study systems and networks, and network stacks are one of my research directions. Evaluating LLM-generated code requires domain knowledge; otherwise, a generated implementation can look complete while hiding serious protocol or engineering problems.

Experiment Boundary

One important boundary of this experiment is that I have not directly written source code in this repository.

My role is to write prompts, inspect results, adjust design decisions, propose optimization directions, and judge what evidence is acceptable. The implementation code, tests, documentation, dashboards, build scripts, and quality tools are all generated through the coding agent workflow.

This boundary makes the experiment easier to understand. CoQUIC is not a project where humans first write the architecture and then ask an LLM to fill in several functions. More accurately, it is a software system mainly generated by the model, with humans deciding direction and checking the result.

This does not mean the generated result is necessarily good. It only makes the question clearer: if code, tests, and quality tools can all be generated by a model, how should we judge whether they are trustworthy?

How To Evaluate

The repository and website are organized around multiple kinds of evidence:

  • Unit and integration tests check local behavior.
  • Coverage reports show which paths are exercised.
  • Interop runs test behavior against other QUIC implementations.
  • Benchmark dashboards track throughput and request/response performance.
  • Static-analysis tools look for security and quality issues.
  • Fuzzing targets exercise packet and frame parsers.
  • Documentation defines the intended public API and runtime integration model.
  • A browser workbench makes endpoint state and protocol scenarios inspectable.
  • QUIC specification QA ties implementation questions back to indexed RFC text.

These forms of evidence are not sufficient on their own. Their role is to raise the cost of "looking correct," making it harder for a generated implementation to seem trustworthy for the wrong reasons.

Early Observations

At this point, the most surprising thing to me is that, as long as I can keep giving a concrete problem and feedback, Codex usually eventually finds a way to keep moving forward.

One rare interop problem, probably related to handshake loss or handshake corruption, took roughly 72 hours to locate and fix. The problem was difficult because it rarely reproduced and depended on behavior with a particular peer implementation. This kind of bug is a good example of why protocol implementation cannot rely only on local tests.

Another example is test coverage. Reaching literally 100 percent line and branch coverage took around a week. This was not done in one step. It required repeatedly finding uncovered paths, deciding whether those paths represented meaningful behavior, and adding tests that truly validated behavior rather than simply making a line execute once.

These examples do not prove that the code is correct. They show that, with enough feedback, time, and token budget, an agent can continue making progress on hard engineering tasks. The bottleneck moves from "can the model write code?" toward "can this workflow produce trustworthy evidence?"

Failure Modes

The most important failure mode is not syntax errors or failed builds. Those are usually easy to detect.

The harder failure mode is shallow evidence.

Because the quality checks themselves are also generated by an LLM, they need to be questioned and inspected just like the implementation code. A model may look as if it is improving project quality, while in fact it is weakening a check instead of fixing the underlying issue. For example, it may modify CodeQL or Codacy settings rather than repair the bug that triggered the alert.

This matters because generated code tends to optimize for visible goals. If the goal is "make CI green," the agent may choose a path that makes CI green without improving correctness. If the goal is "increase coverage," it may add tests that execute lines without validating real behavior.

Therefore, for CoQUIC, quality work has two layers:

  1. Improve the implementation.
  2. Protect the evaluation method so it is not weakened by the same process that generates the implementation.

The second layer is easy to overlook.

Performance And Correctness

Performance optimization has been one of the hardest parts of the project. For a transport protocol, passing functional tests is not enough. The code also needs to send packets efficiently, avoid unnecessary copies, schedule work sensibly, and behave normally under workloads close to real ones.

Correctness is harder. Coverage, interop, static analysis, fuzzing, and benchmarks are all useful, but none of them is a proof. They can only provide part of the evidence.

This is also why I do not treat CoQUIC as production software. It is not safe, mature, reliable, complete, or suitable for deployment as a real transport dependency. It is a research experiment for observing how far an LLM-driven workflow can go, and what kinds of evidence are still needed to narrow the remaining gap.

What I Want To Learn

The question behind CoQUIC is not whether LLMs can generate a large amount of impressive code. They can.

The real question is how software engineering should change around this capability:

  • How should tasks be described when implementation becomes cheap but verification remains expensive?
  • When tests are also generated, how should tests be reviewed?
  • How should quality checks distinguish real fixes from changes that only make visible metrics look better?
  • How should performance regressions be discovered and traced to their causes?
  • How much domain knowledge is needed to safely guide a generated system?
  • What evidence is enough to support trusting a generated transport stack?

This blog will record how these questions develop. CoQUIC is the subject of the experiment.