SPICE: Self-Play In Corpus Environments Improves Reasoning

An overview of our paper, SPICE (Self-Play In Corpus Environments). We introduce SPICE, a reinforcement learning framework where a single model plays two roles: a Challenger that mines a large document corpus to pose diverse, document-grounded reasoning tasks, and a Reasoner that solves them without document access. Corpus grounding supplies the external signal that ungrounded self-play lacks, fixing the hallucination amplification and information symmetry that cause closed-loop self-play to plateau, and yielding consistent gains across mathematical (+8.9%) and general (+9.8%) reasoning. More than a result, it is a first concrete step toward agents that generate their own problems by interacting with the world rather than introspecting.

📄 Paper 🤗 Models

Motivation: The Ceiling of Closed-Loop Self-Play

Reinforcement learning with verifiable rewards (RLVR) gave language models a dramatic boost in reasoning, from OpenAI o1 to DeepSeek-R1 . But RLVR still depends on human-curated problem sets and domain-specific reward engineering. Self-play promises an escape from that bottleneck: a model improves by generating its own problems and solving them, with feedback that needs no human in the loop. This is the recipe that took backgammon and Go past human level.

Yet self-play for language models keeps hitting a wall. Methods that let a model generate its own questions from scratch achieve a few initial gains and then plateau or collapse. We trace this to two failure modes:

More fundamentally, a closed-loop proposer never observes anything outside its own weights, so the set of problems it can pose is bounded by its own distribution, no matter how cleverly you decode. Even methods that work hard to keep synthetic data diverse stay bounded by their starting coverage, which is ultimately just a compressed snapshot of the original pretraining data . R-Zero is a good example of the symptom: it improves for a few iterations, then its self-generated pseudo-labels drift and accuracy degrades. Absolute Zero sidesteps hallucination by grounding in a Python executor (domain-specific grounding), but that confines it to code.

SPICE outperforms state-of-the-art self-play methods for LLMs on Qwen3-4B-Base (right). Training a single model to be both a Challenger and a Reasoner, creating and solving challenging corpus-grounded tasks for itself via self-play RL, is what makes the difference (left).

The lesson we took away: a self-improving system needs to interact with something outside itself. A closed loop can only recombine what it already holds. The question is what that “something” should be for a language agent.

Core Insight: A Corpus Is an Environment

SPICE's Core Insight: Treat a large document corpus as the external environment a language agent interacts with. A single model plays a Challenger that mines raw documents to pose grounded reasoning tasks, and a Reasoner that solves them without seeing the document. Grounding answers in real text removes the fabricated-gold-answer failure mode; hiding the document from the Reasoner creates genuine information asymmetry; and the sheer diversity of the corpus supplies a far larger external signal than a closed loop can generate from its own weights.

The question I actually care about is bigger than this paper: can an agent generate its own problems by interacting with the world, in a lifelong, goal-conditioned loop where it observes something external and turns it into a goal it then tries to achieve? A corpus is the most tractable stand-in for that world I can train on today. Web documents are a compressed image of human knowledge and the digital world, the most accessible microcosm of it we have. By letting the model act on that corpus, sampling a document, extracting a verifiable answer, posing a question, SPICE turns a static dataset into an interactive environment. Crucially, the Reasoner never sees the source document, so the Challenger can ground a question and gold answer in content the Reasoner must genuinely reason to recover. Information flows from the world, through the Challenger, into challenges the Reasoner cannot trivially shortcut.

This single move addresses both failure modes at once: document grounding anchors every question and answer in real content (no hallucination), and the asymmetry plus corpus diversity keeps the challenge alive (no symmetry collapse).

Research Questions

RQ1: Does corpus-grounded self-play beat ungrounded self-play and domain-specific grounding?
RQ2: Do the Challenger and Reasoner actually co-evolve, or does one outrun the other?
RQ3: Which ingredients (grounding, Challenger learning, reward shape, task format) actually matter?
RQ4: What does the emergent curriculum look like qualitatively?

The SPICE Framework

SPICE is a self-play framework where a single LLM acts in two roles. The Challenger reads a raw document (no pre-existing questions or labels) and produces a (question, answer) pair, choosing either a multiple-choice or a free-form format depending on the content. The Reasoner answers without document access. The Challenger is rewarded for questions that maximize variance in the Reasoner's correctness; the Reasoner is rewarded for getting them right.

Challenger: Document-Grounded Task Generation

The Challenger uniformly samples a passage from the corpus (up to ~6K tokens), then takes multiple attempts to extract a verifiable task. It picks one of two formats based on the document: multiple-choice (four options, one document-grounded correct answer) or free-form with a typed answer (integer, expression, or string) extracted directly from the text. These typed formats act as universal verifiers, which is what frees SPICE from the executors and rule-based validators that confined prior self-play to math and code. The prompt walks the Challenger through multi-step information extraction, difficulty enhancement, and self-testing so questions are hard but still answerable without the source document.

Reasoner: Solving Without the Document

Given only the question, the Reasoner reasons step by step and boxes a final answer, relying purely on internalized knowledge. Its reward is binary correctness against the document-extracted gold answer, checked by a rule-based verifier (Math-Verify for expressions, exact match otherwise).

Variance-Based Curriculum Reward

The heart of the automatic curriculum is the Challenger’s reward. For each candidate question the Reasoner samples K answers (in our runs K = 8), and l_i = 1[â_i = a*] is the binary correctness of the i-th sample. The Challenger is rewarded by a Gaussian-shaped function of the variance of those correctness labels:

1
2
3
             ⎧  exp( −(Var(l_1..l_K) − 0.25)² / (2 · 0.01) )    if q is valid
r_C(q, a*) = ⎨
             ⎩  ρ                                               otherwise (penalty)

For binary correctness, Var = p(1 − p) is maximized at p = 0.5, so the reward peaks exactly at a 50% pass rate (variance = 0.25). Questions that are too easy or too hard get exponentially less reward. As the Reasoner improves, the only way for the Challenger to stay rewarded is to pose harder questions, which is precisely the automatic curriculum. Both roles share weights and are trained jointly with DrGRPO using role-specific advantages (Â_C = r_C − mean(r_C) and Â_R = r_R − mean(r_R)), centering each role on its own expectation so gradients reflect genuine learning signal rather than difficulty-induced noise.

RQ1: Does corpus grounding beat ungrounded self-play?

We train on 20,000 documents (a 50/50 mix of Nemotron-CC-Math and NaturalReasoning ) and evaluate on seven math and four general-reasoning benchmarks, across four base models. SPICE wins on every model family, beating ungrounded self-play (R-Zero), domain-specific grounding (Absolute Zero), and even a fixed stronger question generator (Strong Challenger, a frozen Qwen3-32B-Instruct posing the questions).

Base Model Base Strong Chal. R-Zero Absolute Zero SPICE Δ vs Base
Qwen3-4B-Base 35.8 43.0 39.5 40.7 44.9 +9.1
Qwen3-8B-Base 43.0 45.6 46.3 46.5 48.7 +5.7
OctoThinker-3B-Hybrid 14.7 21.0 20.3 21.7 25.2 +10.5
OctoThinker-8B-Hybrid 20.5 28.2 29.9 29.4 32.4 +11.9
Overall accuracy (average over 11 benchmarks). SPICE is the best method on all four base models. Gains span both mathematical reasoning (average +8.9%) and general reasoning (+9.8% across MMLU-Pro, GPQA-Diamond, SuperGPQA, and BBEH), confirming that corpus grounding develops broadly transferable capability rather than a single narrow skill.

The pattern is robust: SPICE delivers the best overall accuracy on all four base models. Ungrounded self-play (R-Zero) gives among the smallest gains on the Qwen3-4B and OctoThinker-3B bases, and a fixed strong generator helps but cannot adapt to the Reasoner, while SPICE’s learned, grounded Challenger wins everywhere, with the largest headroom and the largest gap over baselines on the weaker OctoThinker bases.

RQ2: Do Challenger and Reasoner actually co-evolve?

A healthy self-play system needs both roles to climb together. We probe this by freezing one role at a step-200 checkpoint and sweeping the other across later checkpoints (steps 200-640) on a pool of 128 documents.

Co-evolution of the two roles. (a) Against a fixed step-200 Reasoner, later Challengers drive the pass rate down from 55% to 35%, i.e. they are posing genuinely harder questions. (b) Against a fixed step-200 Challenger, later Reasoners drive the pass rate up from 55% to 85%, i.e. they are genuinely getting better at solving. Neither role is running away from the other.

This is the signature of a working curriculum: each role improves monotonically against a fixed opponent, and neither saturates the other. That joint co-evolution actually beats training the Reasoner against a fixed Challenger is exactly what the ablation in the next section confirms.

RQ3: Which ingredients matter?

(a) Challenger learning: training the Challenger alongside the Reasoner clearly beats a fixed Challenger, the Reasoner is challenged harder and improves faster. (b) Corpus grounding: with grounding, performance climbs steadily to 43.9%; without it, the model stalls at 40.7%. External grounding is the decisive factor. (These are controlled ablation runs at 43.9%; the full SPICE configuration reaches 44.9% by also adding mixed task formats and the variance reward.)

Three ablations pin down what is doing the work:

Design choice Variant Overall
Corpus grounding Without grounding 40.7
With grounding 43.9
Challenger reward Absolute Zero (1 − pass rate) 40.7
R-Zero (uncertainty) 43.6
Variance (SPICE) 44.9
Task format MCQ only 42.0
Free-form only 43.7
MCQ + free-form 44.9
Ablations on Qwen3-4B-Base. Corpus grounding, a learned Challenger, the variance-based reward, and mixed task formats each contribute, and combine for the best result. MCQ gives reliable verification; free-form encourages flexible reasoning; the corpus composition mirrors this (Nemotron-CC-Math helps math most, NaturalReasoning helps general most, both together win overall).

The reward result is the subtle one. Absolute Zero’s “harder is better” signal conflates difficulty with learning value; R-Zero’s uncertainty signal keys on agreement with a single mode; SPICE’s variance reward captures the full spread of the Reasoner’s answer distribution and peaks exactly where learning is richest, a balanced ~50% pass rate.

RQ4: What does the curriculum look like?

Given the same document at different training steps, the Challenger visibly escalates, from extracting an explicit fact to demanding multi-step proportional reasoning that still resolves to a document-stated value.

Early (step 50), surface fact:
"What is the diameter of the Moon?" → B) 3,475 km

Late (step 480), multi-step reasoning:
"An alien moon of diameter 3,475 km creates perfect solar eclipses; its star has the Sun's diameter and the moon orbits at 374,000 km. Maintaining the same angular-size ratio, what is the planet-star distance?" → requires setting up Moon/Star angular-size equality, cross-multiplying, and matching to the document's value.

And the Reasoner escalates in lockstep, from an intuitive guess (“stars are ~1000× farther, so 374,000,000 km”) to a structured derivation: identify givens, write the angular-size equation, solve for the distance, then verify both angular sizes match. The curriculum produces authentic problem decomposition and self-correction, not memorization.

How to Read SPICE: Interaction, RAG, and Borrowed Experience

A candid note on what SPICE is and isn’t, because the honest version is the more useful one.

The motivation I started from was bigger than the paper: a language agent should interact with the world to generate its own problems to train on, a lifelong, goal-conditioned RL setting where “learn from experience” means the agent’s interaction with the internet, because the internet is the most complete microcosm of the real (and digital) world we have. Goals generated that way can be continuous and open-ended, unlike essentially all current self-play, where the proposer samples problems from the model’s internal knowledge (this is what AZR and R-Zero do, no externally observed input, so the reachable problem distribution is bounded by the model itself).

SPICE is the first concrete step toward that picture: I used the pretraining corpus as a stand-in for the internet, and let the model interact with it to manufacture its own goals. The “interaction” is admittedly thin: in the end it is sample a document, condition the Challenger on it. Many readers will call this RAG, and the surface mechanics are similar. The difference is what the document is for: RAG retrieves context to answer a query; SPICE uses the document to manufacture a question and a checkable answer, then hides it from the solver. The retrieval serves problem generation, not problem solving, and that asymmetry is the whole point. The mechanism also does not depend on the corpus being stale: point the same Challenger at a search API and a 2025-cutoff model could pull in 2026 documents and pose questions that genuinely stump its own weights. SPICE uses a fixed corpus, but nothing in the method requires that.

The deeper limitation is the real one: sampling documents is still building tasks out of human experience. The corpus is a microcosm of the internet, but the goals are still drawn from text people wrote. The fuller version of the vision is to let the model explore that environment with its own experience, generating its own problems, building its own worlds, rather than reading ours.

Where This Is Heading: From Corpus Grounding to Self-Generated Worlds

Where we're going next. SPIRAL showed self-play on fixed zero-sum games incentivizes transferable reasoning; SPICE showed that grounding self-play in an external corpus fixes hallucination amplification and information symmetry. The next constraint to cut is the *environment itself*: make it a learnable component that co-evolves with the agent, so the curriculum scales with the learner instead of being bounded by a fixed corpus or a fixed game. This is the direction I am now pursuing with Natasha Jaques, and it connects directly to long-standing ideas in open-endedness and unsupervised environment design , and to the broader move toward agents that learn from their own experience . SPICE grounds the goals in the world's text; the next step lets the agent build the worlds it learns in.

There is a clean arc here: fixed games (SPIRAL) → grounded tasks mined from a corpus (SPICE) → environments the agent designs and grows for itself. Each step removes a different constraint on open-ended self-improvement.

Conclusion

SPICE reframes self-improvement as interaction with an external corpus rather than closed-loop introspection. By splitting a single model into a document-grounded Challenger and a no-document Reasoner, and rewarding the Challenger for questions at the Reasoner’s frontier, SPICE:

  1. Beats ungrounded and domain-specific self-play across four base models, +9.1 / +5.7 / +10.5 / +11.9 overall, with broad transfer to both math (+8.9%) and general (+9.8%) reasoning.
  2. Sustains a genuine curriculum: Challenger and Reasoner co-evolve (fixed-Reasoner pass rate 55%→35%; fixed-Challenger 55%→85%) instead of collapsing.
  3. Isolates grounding as the load-bearing ingredient: removing the corpus drops performance from 43.9% to 40.7%, and the variance-based curriculum reward outperforms prior proposer rewards.

The takeaway is simple: closed-loop self-play runs out of road because a model cannot, by itself, be its own sufficient source of novelty and truth. We trained on only 20K documents and still saw no saturation, so the ceiling is the corpus, not the loop. Point the model at the world, even just the world’s text, and the loop opens back up.


In the spirit of the era of experience : the next round of progress will come from agents that learn from their own stream of experience, not from distilling what we already know.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

Last updated: June 19, 2026.