The AI Research Shaping What's Next: Four Papers Worth Watching
A reading-group recap on scaling, self-improvement, latency, and verification — four different answers to the same question.
We spend time with frontier AI research so that applied teams don't have to read the preprints to know which way the ground is shifting. The papers below are recent, and each one is a signal about what production systems will be expected to do next. None of them is a product you can buy today. All of them change the assumptions you should be designing against.
Strip the marketing away and most frontier ML work is chasing one of two things: making models more capable, or making them more trustworthy. This reading group landed on four papers that, at first glance, have nothing to do with each other — a protein language model, a theorem-proving self-play loop, a voice agent, and a pair of Lean 4 verification frameworks. But they line up neatly along that capable/trustworthy axis.
Two of them are bets on raw capability through scale and self-generated data: protein biology is being eaten by the bitter lesson, and LLMs are learning to write their own curriculum. The other two are bets on trustworthiness under real-world constraints: hiding retrieval latency so a voice agent can stay grounded and fast, and pushing neural networks and generated code into a proof assistant where correctness is mechanically checked. What makes the set interesting is how the themes cross-contaminate — the self-play paper is about theorem proving in Lean, the same substrate as the verification work; the protein model's headline is a scaling argument identical in spirit to LLM pretraining. Capability and trust are not separate tracks. They keep reaching for each other's tools.
AI for Biology: the bitter lesson comes for proteins
For a decade, the state of the art in protein structure prediction leaned hard on an inductive bias: multiple sequence alignments (MSAs). AlphaFold2 and its successors find evolutionarily related sequences, align them, and read co-evolution signals to infer which residues sit close in 3D. It works extraordinarily well — but it bakes in a hand-built pipeline and fails when good homologs don't exist (orphan proteins, designed antibodies).
The counter-bet is the "bitter lesson" argument: stop engineering the inductive bias and let scaled, unsupervised learning plus inference-time compute discover it. That framing is laid out by Alex Rives in the Latent Space episode The Bitter Lesson is Coming for Proteins, and it traces back to the original ESM-2/ESMFold work at Meta AI in 2023, where atomic-structure information emerged from a 15-billion-parameter masked language model trained on sequence alone — and ESMFold predicted structure from a single sequence, MSA-free, at near-AlphaFold2 accuracy but far faster. (Worth being precise: MSA-free folding was born here in 2023, not invented by the newer model below.)
The lineage continued at EvolutionaryScale with ESM3, a multimodal generative model reasoning jointly over sequence, structure, and function — famous for generating esmGFP, a fluorescent protein only 58% identical to its nearest natural relative, roughly 500 million years of simulated evolution. Its representation-focused sibling, ESM Cambrian (ESMC), shipped in December 2024 as a BERT-style protein model trained on ~2.8 billion sequences, with the lab explicitly reporting non-diminishing returns up to 6 billion parameters.
The new work is the June 2026 bioRxiv preprint Language Modeling Materializes a World Model of Protein Biology from Biohub — the same EvolutionaryScale team, now rebranded (Rives is Head of Science; this is the ESM successor org, not the Chan Zuckerberg Biohub). It bundles three artifacts: ESMC as the language model, the new structure-and-design engine ESMFold2, and ESM Atlas, a map of 6.8 billion sequences and 1.1 billion predicted structures. Two things are genuinely new. First, ESMFold2 — a distinct model built on ESMC-6B representations, not a renamed ESMFold — predicts all-atom structures of proteins and complexes and reportedly beats AlphaFold 3 at antibody–antigen binding-pose prediction from learned representations alone, with MSAs demoted to an optional accuracy boost for hard targets. Second, the "world model" claim is made concrete: ESMC's internal representations are decomposed with sparse autoencoders into roughly 16,000 interpretable features that organize the Atlas. That is the bitter lesson cashed out — scale plus interpretability replacing the alignment pipeline. The same lesson is reshaping every domain where a learned representation can stand in for a hand-built feature set, which is one reason we keep coming back to the choice between fine-tuning a model and grounding it with retrieval when we scope client work.
Self-Play for LLMs: a Guide against reward hacking
RL post-training has an appetite problem: it needs a stream of problems with checkable answers, and curating those at scale is expensive. Self-play promises to escape this — let one model generate problems (a Conjecturer) for another to solve (a Solver), and you get an endless, self-tuning curriculum. The catch, well documented across earlier self-play efforts, is collapse: under heavy compute the Conjecturer learns to hack its reward, emitting artificially convoluted problems that score well but teach the Solver nothing.
The reading group's mental model here was a two-role conjecturer + prover loop, which is exactly the precursor STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving from Tengyu Ma's group. The paper under discussion, Scaling Self-Play with Self-Guidance (SGS, Stanford — Hashimoto and Tengyu Ma groups), is the scaled-up successor, and it corrects the two-role picture in an important way.
SGS introduces a third role: a Guide. A single model wears all three hats — Solver, Conjecturer, and Guide — and the Guide is the actual contribution. It scores synthetic problems by relevance to a set of unsolved target problems and by how clean and natural they are, supplying the supervision signal that keeps the Conjecturer from collapsing into reward-hacked junk. This is also not "synthetic math and coding" in the abstract: the system targets formal theorem proving in Lean 4, with all three roles initialized from DeepSeek-Prover-V2-7B. The Guide's relevance signal works precisely because "did the proof check?" is a hard ground truth Lean can provide. If the multi-role framing is unfamiliar, our primer on what AI agents are covers the same idea of a model splitting into cooperating roles.
Why it matters: the headline is that after roughly 200 self-play rounds the 7B model solves more problems than a 671B baseline. But the paper's real thesis is a corrective to naive optimism. Unguided self-play does not scale — it plateaus and collapses. The "infinite self-improvement loop" is a fantasy; what the Guide buys you is continued scaling, not literally unbounded improvement. Self-generated curricula are powerful, but only once you've solved the meta-problem of keeping the curriculum honest.
The same scaling-RL-on-real-tasks thread shows up in industry. The Composer 2 Technical Report from Anysphere (Cursor Research) describes building a coding model from an open base (Moonshot's Kimi K2.5) via continued pretraining and then large-scale RL run inside realistic Cursor sessions — the same tools and harness as the deployed product. The infrastructure is the story: a fully asynchronous, multi-region RL pipeline ("Anyrun") spinning up hundreds of thousands of sandboxed coding environments, with custom low-precision MoE kernels on Blackwell GPUs. RL improved both average and best-of-K performance (evidence the model learns genuinely new solution paths), reporting CursorBench 61.3 (~37% over Composer 1.5) and SWE-bench Multilingual 73.7. SGS and Composer 2 are the same bet at different scales: the bottleneck on RL is no longer the algorithm, it's the supply and integrity of verifiable tasks — which is exactly why a serious production AI stack now treats evaluation harnesses as first-class infrastructure rather than an afterthought.
Real-Time Voice Agents: retrieve while the user is still talking
A grounded voice assistant has a cruel latency budget. Retrieval-augmented generation makes answers accurate, but the standard pipeline is sequential: wait for the user to finish, transcribe, embed, retrieve, generate, speak. Every stage adds delay, and in a spoken conversation a one-second pause feels like a malfunction. You can be accurate or you can be fast; the pipeline forces a choice. (If the RAG mechanics here are new, start with what RAG is and the broader question of how LLMs generate answers.)
Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage (Meta, with Carnegie Mellon co-authors) dissolves the tradeoff by overlapping the stages. It processes incoming user audio in fixed-time blocks and fires retrieval/tool queries in parallel while the user is still speaking — so by the time they stop, the evidence is already in hand and the latency is hidden behind the user's own speech. This is, to the authors' knowledge, the first approach to push tool use directly into an end-to-end speech-in/speech-out system rather than a cascaded ASR→LLM→TTS stack, and it's modality-agnostic (it applies to typed input too).
The recap framed this as a single method, "Fixed-Interval Streaming RAG," but the paper actually proposes two complementary variants. Fixed-Interval fires queries at a regular cadence (~1s) with an external "reflector" module judging whether enough context has accrued. Model-Triggered Streaming RAG uses shorter ~500ms blocks and lets the model itself learn when to fire a tool call. The distinction matters because the strongest results come from the Model-Triggered variant, not the fixed-interval one: QA accuracy improves up to 200% relative (11.1% → 34.2% absolute) on Qwen2.5-7B, with tool-use latency cut ~20.7% on synthetic audio and up to 53.4% on human-recorded audio. (Tool integration alone — before streaming — already bought a 140% relative accuracy gain.) The paper also contributes AudioCRAG, a speech benchmark derived from the public CRAG QA dataset, with both TTS-synthesized (1,862 queries) and human-recorded (618 queries) splits. The trustworthiness lesson is subtle: grounding only counts if it's fast enough to be used, and learning when to retrieve beats retrieving on a fixed clock.
Formal Verification and Math: from GPT-f to neural nets in Lean

If self-play and voice agents are about doing things well, formal verification is about proving you did. The modern lineage starts with Generative Language Modeling for Automated Theorem Proving — OpenAI's GPT-f (Polu and Sutskever, 2020), a transformer prover for the Metamath language. Its insight was that the historical weakness of automated theorem provers versus humans — generating original mathematical terms — is exactly what a generative language model is good at. GPT-f found new, shorter proofs that were accepted into the main Metamath library, to the authors' knowledge the first time a deep-learning system contributed proofs adopted by a formal-math community. That's the seed of every LLM-based prover since, including the DeepSeek-Prover that initializes the SGS work above.
Two 2025–2026 papers from Anima Anandkumar's lean-dojo group — notably both led by Robert Joseph George, so they're a coordinated program, not independent efforts — push this further. BRIDGE: Building Representations In Domain Guided Program Synthesis (the recap paraphrased the title as "...Domain-Guided Verified Program Synthesis"; the canonical arXiv title has no "Verified" and leaves "Domain Guided" unhyphenated) tackles provably-correct code. It's a structured prompting framework that decomposes verified synthesis into three linked domains — Code, Specifications, and Theorem Statements — using a code-first workflow in Lean 4 where the generated implementation anchors the downstream spec and correctness claim. It improves Lean executable correctness by roughly 1.5x (pass@5) across 178 algorithmic problems and five LLMs, and the approach generalizes to Coq/Rocq, Dafny, and Boogie.
The more striking move is TorchLean: Formalizing Neural Networks in Lean, which treats neural networks as both executable programs and mathematical objects with shared semantics inside Lean 4. It provides a PyTorch-style verified API (eager and compiled modes lowering to a shared op-tagged IR), explicit Float32 semantics via an executable IEEE-754 binary32 kernel, and verification via interval bound propagation and CROWN/LiRPA-style bound propagation with certificate checking. Its semantic layers cover attention and FlashAttention, state-space models, diffusion samplers, and RL/MDP objectives, and it ships a mechanized universal approximation theorem alongside end-to-end validation on certified robustness, PINN residual bounds, and Lyapunov-style neural-controller verification. The trajectory from GPT-f to TorchLean is telling: we started by using neural nets to find proofs, and we're now using proof assistants to certify the neural nets themselves.
The through-line
Read together, the four papers are four answers to the same question — how do we make models both more capable and more trustworthy? — and they keep borrowing each other's tools to do it.
Capability is increasingly a story about self-generated data and scale: ESMC/ESMFold2 let scale discover the structural priors that MSAs used to hand-engineer, while SGS and Composer 2 let RL manufacture its own verifiable curriculum. Trustworthiness is a story about verification and grounding: Stream RAG keeps answers tied to retrieved evidence without paying a latency tax, while BRIDGE and TorchLean push code and even neural networks into a setting where correctness is mechanically checkable. The split mirrors the wider divide between agentic and generative AI — systems that act and verify versus systems that merely produce.
The most important cross-link is that verification is becoming the engine of capability. SGS scales only because Lean can tell it which proofs actually check; Composer 2 improves only because sandboxed code either runs or doesn't. Ground truth that a machine can confirm — a passing proof, a passing test, a folded structure that matches reality — is the scarce resource. The labs winning the capability race are the ones who figured out how to manufacture it.
For a wider view of the trends these papers feed into, see our related reading on where AI is heading in 2026.
---
At Crux Digits we build applied AI for businesses, on fixed scope: an audit, then a proof of concept, then production — EU/GDPR-aware and human-in-the-loop throughout. If a frontier idea here maps to a real problem you're sizing up, start a conversation or read more about our AI consulting in the Netherlands.
References
- Language Modeling Materializes a World Model of Protein Biology — Biohub (formerly EvolutionaryScale), bioRxiv, June 2026.
- Biohub: A world model of protein biology (ESMFold2 / ESM Atlas release)
- ESM Cambrian (ESMC) — EvolutionaryScale blog, December 2024
- ESM/ESMFold2: The Bitter Lesson is Coming for Proteins — Alex Rives, Latent Space
- Evolutionary-scale prediction of atomic-level protein structure with a language model (ESM-2/ESMFold) — Lin et al., Science, 2023.
- Simulating 500 million years of evolution with a language model (ESM3) — Hayes et al., Science, 2025.
- Scaling Self-Play with Self-Guidance (SGS) — Bailey, Wen, Dong, Hashimoto, Ma (Stanford), arXiv 2604.20209.
- STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving — Dong & Ma, arXiv 2502.00212.
- Composer 2 Technical Report — Cursor Research (Anysphere), arXiv 2603.24477.
- Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage — Meta & CMU, arXiv 2510.02044.
- Generative Language Modeling for Automated Theorem Proving (GPT-f) — Polu & Sutskever (OpenAI), arXiv 2009.03393.
- BRIDGE: Building Representations In Domain Guided Program Synthesis — George et al. (lean-dojo / Anandkumar group), arXiv 2511.21104.
- TorchLean: Formalizing Neural Networks in Lean — George et al. (lean-dojo / Anandkumar group), arXiv 2602.22631.
Frequently asked questions
What are the four AI research papers covered, and why these four?
The post covers a protein language model (ESMC/ESMFold2 from Biohub), a self-play theorem-proving loop (Scaling Self-Play with Self-Guidance, SGS), a streaming voice RAG system (Stream RAG from Meta and CMU), and two Lean 4 verification frameworks (BRIDGE and TorchLean). They were chosen because, despite covering different fields, they all sit on the same axis: making models either more capable or more trustworthy. Together they show how those two goals increasingly borrow each other's tools.
What is the 'bitter lesson' argument in protein structure prediction?
It is the bet that you should stop hand-engineering inductive biases — like the multiple sequence alignments (MSAs) that AlphaFold2 relies on — and instead let scaled, unsupervised learning plus inference-time compute discover those priors on their own. ESM-2/ESMFold first showed MSA-free folding in 2023, and the newer ESMC and ESMFold2 work pushes the idea further, demoting MSAs to an optional accuracy boost for hard targets. It mirrors the same scale-over-engineering trend seen in LLM pretraining.
Why does self-play for LLMs need a third 'Guide' role?
Naive two-role self-play, where one model conjectures problems and another solves them, tends to collapse under heavy compute: the conjecturer learns to hack its reward by emitting convoluted problems that score well but teach nothing. The Guide, introduced in SGS, scores synthetic problems by relevance to unsolved targets and by how clean they are, keeping the curriculum honest. The lesson is that self-improvement scales further with guidance, but it is not an unbounded, infinite loop.
How does streaming RAG make voice agents both fast and grounded?
A standard RAG voice pipeline runs sequentially — wait, transcribe, embed, retrieve, generate, speak — which adds noticeable latency. Stream RAG overlaps these stages, processing audio in fixed-time blocks and firing retrieval and tool queries in parallel while the user is still speaking. By the time the user stops, the evidence is already in hand, so the latency is hidden behind their own speech. The strongest gains come from a model-triggered variant that learns when to retrieve rather than firing on a fixed clock.
What is the connection between GPT-f and formalising neural networks in Lean?
GPT-f (2020) used a generative language model to find original proofs, some of which were accepted into the Metamath library — the first time a deep-learning system contributed proofs adopted by a formal-math community. The newer BRIDGE and TorchLean work inverts the relationship: instead of using neural nets to find proofs, it uses proof assistants like Lean 4 to certify the neural networks themselves, with verified APIs, explicit floating-point semantics, and bound-propagation certificates. The arc runs from using AI to do mathematics toward using mathematics to verify AI.