Templates and Samplers: Stage 2 Execution Configuration¶
Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology - config.md - Experiment configuration (Stage 1)
Overview¶
Templates and samplers are Stage 2: Execution components that control how models are prompted and how they generate responses. While experiment configs (Stage 1) define WHAT to test, templates and samplers define HOW to execute those tests.
Location:
- Templates: templates/*.json
- Samplers: samplers/*.json
Used by: runner.py during evaluation execution
Templates: Prompting Strategies¶
Templates define how test cases are transformed into conversations — the logical system/user/assistant turn structure that the runner submits to the model. This is distinct from the model's chat template (a tokenizer-level concern), which is the subsequent stage that converts the conversation structure into raw text with special delimiter tokens before tokenization.
Templates fall into two categories: universal templates that work with any model, and model-specific templates that inject proprietary system prompts required to engage a particular model's reasoning mode.
Universal Templates¶
| Template | System Prompt | Format | Use Case |
|---|---|---|---|
| zeroshot | Task description | Direct answer | Baseline; instruction-tuned models |
| zeroshot-nosys | None | Direct answer | System-prompt ablation |
| zerocot-nosys | None | Step-by-step | Default for reasoning models |
zerocot-nosys is the standard template for reason-tuned models. It makes no assumptions about how the model should reason — the model performs its own expansion and reduction.
Model-Specific Templates¶
These templates inject a system prompt that activates a vendor-specific reasoning mode. Use only with the indicated model family.
Note: These are a kludge. Reasoning activation is a generation concern and belongs in the sampler — exposed as
reasoning_effort, with the mechanics handled inside the model's chat template. Well-behaved models (gpt-oss, solar, mistral4) do exactly this. The model-specific templates here exist only for models whose serving stacks do not exposereasoning_effort, requiring reasoning activation to be smuggled in via system prompt at the conversation level instead.
| Template | Model Family | System Prompt | Format |
|---|---|---|---|
| llama-nemotron | Llama Nemotron (70B+) | "detailed thinking on" |
Direct answer |
| llama-nemotron-8b | Llama Nemotron 8B | "detailed thinking on" |
Step-by-step |
| magistral | Magistral | [THINK]/[/THINK] format |
Direct answer |
magistral is a double-kludge: it requires both this template (reasoning activation, since reasoning_effort is unsupported) and samplers/magistral-max (which disables vLLM's special token suppression). Magistral marks thinking tokens as special tokens, and vLLM strips them by default — removing the delimiters and making the reasoning trace unparseable. Both files are required together.
Usage¶
python runner.py \
--config configs/r12.yaml \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json \
--model your-model \
--apibase http://localhost:3333
Samplers: Generation Parameter Control¶
Samplers define the generation parameters for LLM inference, controlling how models produce responses.
Available Samplers¶
Samplers are JSON files in samplers/ with OpenAI-compatible parameters:
{
"temperature": 0.0,
"top_p": 1.0,
"max_tokens": 16384
}
Common Sampler Families¶
| Sampler Class | Examples | Purpose |
|---|---|---|
| Greedy | greedy-2k, greedy-4k, greedy-8k, greedy-16k, greedy-max |
Deterministic reproducible evaluation (greedy-16k is r12's context baseline) |
| Mistral | magistral-2k, magistral-6k, magistral-8k, magistral-max |
Mistral model optimization |
| OpenAI O1 | o1-low, o1-medium, o1-high, o1-none |
Reasoning effort control |
| Qwen3 | qwen3-think-2k, qwen3-think-4k, qwen3-think-max |
Qwen reasoning modes |
| Seed-OSS | seedoss-0k, seedoss-2k, seedoss-4k, seedoss-6k, seedoss-unltd |
Open-source thinking budget |
Sampler Parameters¶
Standard parameters:
- temperature: Sampling temperature (0.0 = greedy, 1.0 = diverse)
- top_p: Nucleus sampling probability cutoff
- max_tokens: Maximum output length
Model-specific parameters:
- reasoning_effort: For reasoning-capable models (low/medium/high)
- min_p: Minimum probability threshold for token sampling
- repetition_penalty: Penalize repeated tokens
- chat_template_kwargs: Special chat template features
- thinking: Enable/disable explicit reasoning mode
- keep_cots: Preserve Chain-of-Thought in output
- skip_special_tokens: Whether to suppress special tokens
- stop_token_ids: Token IDs that stop generation
Sampler Selection¶
Choose based on your evaluation goals:
| Goal | Recommended Samplers |
|---|---|
| Reproducible benchmarking | greedy-4k, greedy-max |
| Mistral models | magistral-6k, magistral-8k |
| OpenAI O1 models | o1-medium, o1-high |
| Qwen models | qwen3-think-4k, qwen3-think-max |
| Resource constraints | *-2k variants |
| Complex tasks | *-8k, *-max variants |
Usage¶
python runner.py \
--config configs/r12.yaml \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json \
--model your-model \
--apibase http://localhost:3333
Integration with runner.py¶
Templates and samplers are specified at runtime:
python runner.py \
--config configs/r12.yaml \ # Stage 1: What to test
--template templates/zerocot-nosys.json \ # Stage 2: How to prompt
--sampler samplers/greedy-4k.json \ # Stage 2: How to generate
--model your-model \
--apibase http://localhost:3333
The combination of (model, template, sampler) creates a unique evaluation identity (eval_id).
See Also¶
- config.md - Experiment configuration (Stage 1)
- datasets.md - Result organization (Stage 2/3)
- architecture.md - Five-stage methodology