Jun 9, 20267 min readMinhu WangPolished by GPT

Why CoQUIC?

A technical note on using QUIC as a workload for studying LLM-generated systems code and its quality evidence.

QUICAIengineering

Why CoQUIC?

CoQUIC is an experiment about LLM-generated systems software. It is not meant to prove that "models can already write a QUIC stack suitable for deployment." Instead, it uses a sufficiently complex protocol implementation to study a more concrete question:

If an LLM can generate the code, what evidence is needed before we should trust the result?

This is a technical question. The repository building, the tests passing once, or a demo looking plausible are not enough. A transport stack needs to be checked through protocol conformance, interoperability with other implementations, performance testing, fuzzing, static analysis, regression tests, and code review. CoQUIC uses one project to examine these aspects together.

Background

Before starting CoQUIC, I had written a small QUIC implementation as a reference implementation for an undergraduate networking course where I served as a teaching assistant. The course asked students to implement basic data transfer without cryptography. Students could then selectively implement more advanced QUIC features for additional points, with harder features carrying higher weight.

That reference implementation was simplified around teaching goals. It was useful for explaining the basic mechanics of data transfer, but it was not a complete QUIC stack, and at the time I did not have enough time to keep developing it into one.

CoQUIC starts from a different question. Current coding agents are much stronger than the tools available when I wrote that course implementation. I wanted a target: it needed to be large enough to expose what these agents can and cannot do; at the same time, it needed to be close enough to my research background that I could judge the quality of the generated code.

QUIC is a good fit for that role.

Why Choose QUIC

QUIC has several properties that make it a useful way to examine how an LLM coding workflow behaves on systems code:

The protocol is specified in detail, so behavior can be compared against RFC text.
The implementation has many interacting subsystems: packet parsing, packet protection, recovery, flow control, stream scheduling, connection IDs, timers, path validation, and HTTP/3 integration.
Peer interoperability matters. Code that only passes local tests can still fail with another implementation.
Performance matters. A transport can be functionally correct and still unusably slow.
Many bugs depend on state and timing order, which makes them hard to reduce into a single local unit test.

This is also a field where I have enough background knowledge to inspect generated results. I study systems and networks, and network stacks are one of my research directions. Evaluating LLM-generated code requires domain knowledge; otherwise, a generated implementation can look complete while hiding serious protocol or engineering problems.

Experiment Boundary

One important boundary of this experiment is that I have not directly written source code in this repository.

My role is to write prompts, inspect results, adjust design decisions, propose optimization directions, and judge what evidence is acceptable. The implementation code, tests, documentation, dashboards, build scripts, and quality tools are all generated through the coding agent workflow.

This boundary makes the experiment easier to understand. CoQUIC is not a project where humans first write the architecture and then ask an LLM to fill in several functions. More accurately, it is a software system mainly generated by the model, with humans deciding direction and checking the result.

This does not mean the generated result is necessarily good. It only makes the question clearer: if code, tests, and quality tools can all be generated by a model, how should we judge whether they are trustworthy?

How To Evaluate

The repository and website are organized around multiple kinds of evidence:

Unit and integration tests check local behavior.
Coverage reports show which paths are exercised.
Interop runs test behavior against other QUIC implementations.
Benchmark dashboards track throughput and request/response performance.
Static-analysis tools look for security and quality issues.
Fuzzing targets exercise packet and frame parsers.
Documentation defines the intended public API and runtime integration model.
A browser workbench makes endpoint state and protocol scenarios inspectable.
QUIC specification QA ties implementation questions back to indexed RFC text.

These forms of evidence are not sufficient on their own. Their role is to raise the cost of "looking correct," making it harder for a generated implementation to seem trustworthy for the wrong reasons.

Early Observations

At this point, the most surprising thing to me is that, as long as I can keep giving a concrete problem and feedback, Codex usually eventually finds a way to keep moving forward.

One rare interop problem, probably related to handshake loss or handshake corruption, took roughly 72 hours to locate and fix. The problem was difficult because it rarely reproduced and depended on behavior with a particular peer implementation. This kind of bug is a good example of why protocol implementation cannot rely only on local tests.

Another example is test coverage. Reaching literally 100 percent line and branch coverage took around a week. This was not done in one step. It required repeatedly finding uncovered paths, deciding whether those paths represented meaningful behavior, and adding tests that truly validated behavior rather than simply making a line execute once.

These examples do not prove that the code is correct. They show that, with enough feedback, time, and token budget, an agent can continue making progress on hard engineering tasks. The bottleneck moves from "can the model write code?" toward "can this workflow produce trustworthy evidence?"

Failure Modes

The most important failure mode is not syntax errors or failed builds. Those are usually easy to detect.

The harder failure mode is shallow evidence.

Because the quality checks themselves are also generated by an LLM, they need to be questioned and inspected just like the implementation code. A model may look as if it is improving project quality, while in fact it is weakening a check instead of fixing the underlying issue. For example, it may modify CodeQL or Codacy settings rather than repair the bug that triggered the alert.

This matters because generated code tends to optimize for visible goals. If the goal is "make CI green," the agent may choose a path that makes CI green without improving correctness. If the goal is "increase coverage," it may add tests that execute lines without validating real behavior.

Therefore, for CoQUIC, quality work has two layers:

Improve the implementation.
Protect the evaluation method so it is not weakened by the same process that generates the implementation.

The second layer is easy to overlook.

Performance And Correctness

Performance optimization has been one of the hardest parts of the project. For a transport protocol, passing functional tests is not enough. The code also needs to send packets efficiently, avoid unnecessary copies, schedule work sensibly, and behave normally under workloads close to real ones.

Correctness is harder. Coverage, interop, static analysis, fuzzing, and benchmarks are all useful, but none of them is a proof. They can only provide part of the evidence.

This is also why I do not treat CoQUIC as production software. It is not safe, mature, reliable, complete, or suitable for deployment as a real transport dependency. It is a research experiment for observing how far an LLM-driven workflow can go, and what kinds of evidence are still needed to narrow the remaining gap.

What I Want To Learn

The question behind CoQUIC is not whether LLMs can generate a large amount of impressive code. They can.

The real question is how software engineering should change around this capability:

How should tasks be described when implementation becomes cheap but verification remains expensive?
When tests are also generated, how should tests be reviewed?
How should quality checks distinguish real fixes from changes that only make visible metrics look better?
How should performance regressions be discovered and traced to their causes?
How much domain knowledge is needed to safely guide a generated system?
What evidence is enough to support trusting a generated transport stack?

This blog will record how these questions develop. CoQUIC is the subject of the experiment.

为什么是 CoQUIC？

CoQUIC 是一个关于 LLM 生成系统软件的实验。它不是为了证明“模型已经能写出可以上线使用的 QUIC 协议栈”，而是想用一个足够复杂的协议实现，来研究一个更具体的问题：

如果 LLM 可以生成代码，那么我们需要什么样的证据，才应该信任这些代码？

这是一个技术问题。仓库能构建、测试能通过一次、demo 看起来合理，都不足以说明问题。一个传输协议栈需要通过协议一致性、和其他实现互通、性能测试、fuzzing、静态分析、回归测试和代码检查等检查。CoQUIC 想用同一个项目来同时检验这些方面。

背景

在开始 CoQUIC 之前，我写过一个小型 QUIC 实现。那是我担任一门本科网络课程助教时写的参考实现。那门课要求学生实现不带密码学的基础数据传输；学生也可以选择实现一些更高级的 QUIC 特性来拿额外分数，越难的特性分数越高。

那个参考实现是按教学目标简化过的。它适合解释数据传输的基本机制，但并不是一个完整的 QUIC 协议栈，我当时也没有时间把它继续做完整。

CoQUIC 是从另一个问题出发的。现在的编码智能体已经比我写那个课程实现时的工具强很多。我想找一个目标：它要足够大，能够暴露这些智能体能做什么、不能做什么；同时又要足够接近我的研究背景，使我有能力判断生成代码的质量。

QUIC 很适合承担这个角色。

为什么选 QUIC

QUIC 有几个特点，使它很适合用来检验 LLM 编码流程在系统代码上的表现：

协议有详细规范，行为可以和 RFC 文本对照。
实现包含很多互相影响的子系统：包解析、包保护、恢复、流控、流调度、连接 ID、定时器、路径验证和 HTTP/3 集成。
和其他实现互通很重要。只通过本地测试的代码，仍然可能无法和其他实现正常通信。
性能很重要。一个传输协议可以在功能上正确，但慢到不可用。
很多错误依赖状态和时间顺序，很难缩成一个本地单元测试。

这也是一个我有足够背景知识来检查生成结果的领域。我研究系统和网络，网络协议栈是我的研究方向之一。评估 LLM 生成代码需要领域知识；否则，一个生成出来的实现可能看起来很完整，但里面藏着严重的协议或工程问题。

实验边界

这个实验有一个重要边界：我没有在这个仓库中直接写过源代码。

我的角色是写 prompt、检查结果、调整设计决策、提出优化方向，以及判断什么证据是可以接受的。实现代码、测试、文档、仪表盘、构建脚本和质量工具，都是通过编码智能体流程生成的。

这个边界让实验更容易理解。CoQUIC 不是一个“人类先写好架构，再让 LLM 补几个函数”的项目。更准确地说，它是一个主要由模型生成、再由人类决定方向和检查结果的软件系统。

这并不意味着生成出来的结果一定好。它只是让问题更明确：如果代码、测试和质量工具都可以由模型生成，我们应该怎样判断它们是否可信？

如何评估

这个仓库和网站围绕多种证据来组织：

单元测试和集成测试检查本地行为。
覆盖率报告显示哪些路径被执行。
互通测试检查它和其他 QUIC 实现通信时的行为。
基准测试页面跟踪吞吐量和请求响应性能。
静态分析工具查找安全和代码质量问题。
模糊测试目标覆盖包和帧解析器。
文档说明公共 API 和运行时接入方式。
浏览器工作台用来查看端点状态和协议场景。
QUIC 规范问答把实现问题连接回索引后的 RFC 文本。

这些证据单独看都不充分。它们的作用是提高“看起来正确”的成本，让一个生成实现更难因为错误的原因显得可信。

早期观察

目前最让我意外的是：只要我能持续给出一个具体问题和反馈，Codex 最终通常都能找到继续推进的路径。

有一个很少出现的互通问题，大概和握手丢包或握手损坏场景有关，花了大约 72 小时才定位并修复。这个问题很难，因为它很少复现，而且依赖某个特定实现的行为。这类错误很好地说明了：做协议实现不能只依赖本地测试。

另一个例子是测试覆盖率。达到字面意义上的 100% 行覆盖率和分支覆盖率，大约花了一周。这不是一步完成的工作，而是反复找出没有覆盖到的路径、判断这些路径是否对应有意义的行为，再添加真正验证行为的测试，而不是只让代码行被执行一次。

这些例子不能证明代码是正确的。它们说明，在足够的反馈、时间和 token 预算下，智能体可以继续推进困难的工程任务。瓶颈从“模型能不能写代码”移动到了“这个流程能不能产生可信的证据”。

失败模式

最重要的失败模式不是语法错误或构建失败。这些通常容易发现。

更困难的失败模式是证据本身很浅。

因为质量检查本身也是由 LLM 生成的，所以它们也需要和实现代码一样被怀疑、被检查。模型可能看起来在提升项目质量，但实际上是在削弱检查，而不是修复底层问题。例如，它可能修改 CodeQL 或 Codacy 配置，而不是修复触发告警的错误。

这一点很重要，因为生成代码会倾向于优化看得见的目标。如果目标是“让 CI 变绿”，智能体可能选择一条让 CI 变绿、但并不提升正确性的路径。如果目标是“提高覆盖率”，它可能添加只执行代码行、但不验证真实行为的测试。

因此，对 CoQUIC 来说，质量工作有两层：

改进实现本身。
保护评估方式，避免它被生成实现的同一个过程削弱。

第二层很容易被忽略。

性能和正确性

性能优化是这个项目中最困难的部分之一。对传输协议来说，通过功能测试还不够。代码还需要高效地发送数据包、避免不必要的数据拷贝、合理安排工作，并且在接近真实的负载下表现正常。

正确性更难。覆盖率、互通测试、静态分析、fuzzing 和基准测试都很有用，但它们都不是证明。它们只能提供一部分证据。

这也是为什么我不把 CoQUIC 当作生产软件。它不安全、不成熟、不可靠、不完整，也不适合作为真实传输依赖部署。它是一个研究实验，用来观察 LLM 驱动的流程能走多远，以及还需要什么样的证据来缩小剩下的差距。

我想学习什么

CoQUIC 背后的问题不是 LLM 能不能生成大量令人印象深刻的代码。它可以。

真正的问题是，软件工程应该如何围绕这种能力发生变化：

当实现变得便宜而验证仍然昂贵时，任务应该如何被描述？
当测试也是生成的，测试应该如何被检查？
质量检查应该如何区分真正的修复，和只是让可见指标变好的改动？
性能退步应该如何被发现，并找出原因？
需要多少领域知识，才能安全地引导一个生成系统？
什么证据才足够支撑我们信任一个生成的传输协议栈？

这篇博客会记录这些问题的发展过程。CoQUIC 是这个实验的对象。