An overview of our paper, SPICE (Self-Play In Corpus Environments). We introduce SPICE, a reinforcement learning framework where a single model plays two roles: a Challenger that mines a large document corpus to pose diverse, document-grounded reasoning tasks, and a Reasoner that solves them without document access. Corpus grounding supplies the external signal that ungrounded self-play lacks, fixing the hallucination amplification and information symmetry that cause closed-loop self-play to plateau, and yielding consistent gains across mathematical (+8.9%) and general (+9.8%) reasoning. More than a result, it is a first concrete step toward agents that generate their own problems by interacting with the world rather than introspecting.
Reinforcement learning with verifiable rewards (RLVR) gave language models a dramatic boost in reasoning, from OpenAI o1
Yet self-play for language models keeps hitting a wall. Methods that let a model generate its own questions from scratch achieve a few initial gains and then plateau or collapse. We trace this to two failure modes:
More fundamentally, a closed-loop proposer never observes anything outside its own weights, so the set of problems it can pose is bounded by its own distribution, no matter how cleverly you decode. Even methods that work hard to keep synthetic data diverse stay bounded by their starting coverage, which is ultimately just a compressed snapshot of the original pretraining data
The lesson we took away: a self-improving system needs to interact with something outside itself. A closed loop can only recombine what it already holds. The question is what that “something” should be for a language agent.
The question I actually care about is bigger than this paper: can an agent generate its own problems by interacting with the world, in a lifelong, goal-conditioned loop where it observes something external and turns it into a goal it then tries to achieve? A corpus is the most tractable stand-in for that world I can train on today. Web documents are a compressed image of human knowledge and the digital world, the most accessible microcosm of it we have. By letting the model act on that corpus, sampling a document, extracting a verifiable answer, posing a question, SPICE turns a static dataset into an interactive environment. Crucially, the Reasoner never sees the source document, so the Challenger can ground a question and gold answer in content the Reasoner must genuinely reason to recover. Information flows from the world, through the Challenger, into challenges the Reasoner cannot trivially shortcut.
This single move addresses both failure modes at once: document grounding anchors every question and answer in real content (no hallucination), and the asymmetry plus corpus diversity keeps the challenge alive (no symmetry collapse).
The Challenger uniformly samples a passage from the corpus (up to ~6K tokens), then takes multiple attempts to extract a verifiable task. It picks one of two formats based on the document: multiple-choice (four options, one document-grounded correct answer) or free-form with a typed answer (integer, expression, or string) extracted directly from the text. These typed formats act as universal verifiers, which is what frees SPICE from the executors and rule-based validators that confined prior self-play to math and code. The prompt walks the Challenger through multi-step information extraction, difficulty enhancement, and self-testing so questions are hard but still answerable without the source document.
Given only the question, the Reasoner reasons step by step and boxes a final answer, relying purely on internalized knowledge. Its reward is binary correctness against the document-extracted gold answer, checked by a rule-based verifier (Math-Verify for expressions, exact match otherwise).
The heart of the automatic curriculum is the Challenger’s reward. For each candidate question the Reasoner samples K answers (in our runs K = 8), and l_i = 1[â_i = a*] is the binary correctness of the i-th sample. The Challenger is rewarded by a Gaussian-shaped function of the variance of those correctness labels:
1
2
3
⎧ exp( −(Var(l_1..l_K) − 0.25)² / (2 · 0.01) ) if q is valid
r_C(q, a*) = ⎨
⎩ ρ otherwise (penalty)
For binary correctness, Var = p(1 − p) is maximized at p = 0.5, so the reward peaks exactly at a 50% pass rate (variance = 0.25). Questions that are too easy or too hard get exponentially less reward. As the Reasoner improves, the only way for the Challenger to stay rewarded is to pose harder questions, which is precisely the automatic curriculum. Both roles share weights and are trained jointly with DrGRPO Â_C = r_C − mean(r_C) and Â_R = r_R − mean(r_R)), centering each role on its own expectation so gradients reflect genuine learning signal rather than difficulty-induced noise.
We train on 20,000 documents (a 50/50 mix of Nemotron-CC-Math
| Base Model | Base | Strong Chal. | R-Zero | Absolute Zero | SPICE | Δ vs Base |
|---|---|---|---|---|---|---|
| Qwen3-4B-Base | 35.8 | 43.0 | 39.5 | 40.7 | 44.9 | +9.1 |
| Qwen3-8B-Base | 43.0 | 45.6 | 46.3 | 46.5 | 48.7 | +5.7 |
| OctoThinker-3B-Hybrid | 14.7 | 21.0 | 20.3 | 21.7 | 25.2 | +10.5 |
| OctoThinker-8B-Hybrid | 20.5 | 28.2 | 29.9 | 29.4 | 32.4 | +11.9 |
The pattern is robust: SPICE delivers the best overall accuracy on all four base models. Ungrounded self-play (R-Zero) gives among the smallest gains on the Qwen3-4B and OctoThinker-3B bases, and a fixed strong generator helps but cannot adapt to the Reasoner, while SPICE’s learned, grounded Challenger wins everywhere, with the largest headroom and the largest gap over baselines on the weaker OctoThinker bases.
A healthy self-play system needs both roles to climb together. We probe this by freezing one role at a step-200 checkpoint and sweeping the other across later checkpoints (steps 200-640) on a pool of 128 documents.
This is the signature of a working curriculum: each role improves monotonically against a fixed opponent, and neither saturates the other. That joint co-evolution actually beats training the Reasoner against a fixed Challenger is exactly what the ablation in the next section confirms.
Three ablations pin down what is doing the work:
| Design choice | Variant | Overall |
|---|---|---|
| Corpus grounding | Without grounding | 40.7 |
| With grounding | 43.9 | |
| Challenger reward | Absolute Zero (1 − pass rate) | 40.7 |
| R-Zero (uncertainty) | 43.6 | |
| Variance (SPICE) | 44.9 | |
| Task format | MCQ only | 42.0 |
| Free-form only | 43.7 | |
| MCQ + free-form | 44.9 |
The reward result is the subtle one. Absolute Zero’s “harder is better” signal conflates difficulty with learning value; R-Zero’s uncertainty signal keys on agreement with a single mode; SPICE’s variance reward captures the full spread of the Reasoner’s answer distribution and peaks exactly where learning is richest, a balanced ~50% pass rate.
Given the same document at different training steps, the Challenger visibly escalates, from extracting an explicit fact to demanding multi-step proportional reasoning that still resolves to a document-stated value.
And the Reasoner escalates in lockstep, from an intuitive guess (“stars are ~1000× farther, so 374,000,000 km”) to a structured derivation: identify givens, write the angular-size equation, solve for the distance, then verify both angular sizes match. The curriculum produces authentic problem decomposition and self-correction, not memorization.
A candid note on what SPICE is and isn’t, because the honest version is the more useful one.
The motivation I started from was bigger than the paper: a language agent should interact with the world to generate its own problems to train on, a lifelong, goal-conditioned RL setting where “learn from experience” means the agent’s interaction with the internet, because the internet is the most complete microcosm of the real (and digital) world we have. Goals generated that way can be continuous and open-ended, unlike essentially all current self-play, where the proposer samples problems from the model’s internal knowledge (this is what AZR
SPICE is the first concrete step toward that picture: I used the pretraining corpus as a stand-in for the internet, and let the model interact with it to manufacture its own goals. The “interaction” is admittedly thin: in the end it is sample a document, condition the Challenger on it. Many readers will call this RAG, and the surface mechanics are similar. The difference is what the document is for: RAG retrieves context to answer a query; SPICE uses the document to manufacture a question and a checkable answer, then hides it from the solver. The retrieval serves problem generation, not problem solving, and that asymmetry is the whole point. The mechanism also does not depend on the corpus being stale: point the same Challenger at a search API and a 2025-cutoff model could pull in 2026 documents and pose questions that genuinely stump its own weights. SPICE uses a fixed corpus, but nothing in the method requires that.
The deeper limitation is the real one: sampling documents is still building tasks out of human experience. The corpus is a microcosm of the internet, but the goals are still drawn from text people wrote. The fuller version of the vision is to let the model explore that environment with its own experience, generating its own problems, building its own worlds, rather than reading ours.
There is a clean arc here: fixed games (SPIRAL) → grounded tasks mined from a corpus (SPICE) → environments the agent designs and grows for itself. Each step removes a different constraint on open-ended self-improvement.
SPICE reframes self-improvement as interaction with an external corpus rather than closed-loop introspection. By splitting a single model into a document-grounded Challenger and a no-document Reasoner, and rewarding the Challenger for questions at the Reasoner’s frontier, SPICE:
The takeaway is simple: closed-loop self-play runs out of road because a model cannot, by itself, be its own sufficient source of novelty and truth. We trained on only 20K documents and still saw no saturation, so the ceiling is the corpus, not the loop. Point the model at the world, even just the world’s text, and the loop opens back up.
In the spirit of the era of experience
Here are some more articles you might like to read next: