Templates and Samplers: Stage 2 Execution Configuration¶

Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology - config.md - Experiment configuration (Stage 1)

Overview¶

Templates and samplers are Stage 2: Execution components that control how models are prompted and how they generate responses. While experiment configs (Stage 1) define WHAT to test, templates and samplers define HOW to execute those tests.

Location: - Templates: templates/*.json - Samplers: samplers/*.json

Used by: runner.py during evaluation execution

Templates: Prompting Strategies¶

Templates define how test cases are transformed into conversations — the logical system/user/assistant turn structure that the runner submits to the model. This is distinct from the model's chat template (a tokenizer-level concern), which is the subsequent stage that converts the conversation structure into raw text with special delimiter tokens before tokenization.

Templates fall into two categories: universal templates that work with any model, and model-specific templates that inject proprietary system prompts required to engage a particular model's reasoning mode.

Universal Templates¶

Template	System Prompt	Format	Use Case
zeroshot	Task description	Direct answer	Baseline; instruction-tuned models
zeroshot-nosys	None	Direct answer	System-prompt ablation
zerocot-nosys	None	Step-by-step	Default for reasoning models

zerocot-nosys is the standard template for reason-tuned models. It makes no assumptions about how the model should reason — the model performs its own expansion and reduction.

Model-Specific Templates¶

These templates inject a system prompt that activates a vendor-specific reasoning mode. Use only with the indicated model family.

Note: These are a kludge. Reasoning activation is a generation concern and belongs in the sampler — exposed as reasoning_effort, with the mechanics handled inside the model's chat template. Well-behaved models (gpt-oss, solar, mistral4) do exactly this. The model-specific templates here exist only for models whose serving stacks do not expose reasoning_effort, requiring reasoning activation to be smuggled in via system prompt at the conversation level instead.

Template	Model Family	System Prompt	Format
llama-nemotron	Llama Nemotron (70B+)	`"detailed thinking on"`	Direct answer
llama-nemotron-8b	Llama Nemotron 8B	`"detailed thinking on"`	Step-by-step
magistral	Magistral	`[THINK]/[/THINK]` format	Direct answer

magistral is a double-kludge: it requires both this template (reasoning activation, since reasoning_effort is unsupported) and samplers/magistral-max (which disables vLLM's special token suppression). Magistral marks thinking tokens as special tokens, and vLLM strips them by default — removing the delimiters and making the reasoning trace unparseable. Both files are required together.

Usage¶

python runner.py \
    --config configs/r12.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333

Samplers: Generation Parameter Control¶

Samplers define the generation parameters for LLM inference, controlling how models produce responses.

Available Samplers¶

Samplers are JSON files in samplers/ with OpenAI-compatible parameters:

{
  "temperature": 0.0,
  "top_p": 1.0,
  "max_tokens": 16384
}

Common Sampler Families¶

Sampler Class	Examples	Purpose
Greedy	`greedy-2k`, `greedy-4k`, `greedy-8k`, `greedy-16k`, `greedy-max`	Deterministic reproducible evaluation (`greedy-16k` is r12's context baseline)
Mistral	`magistral-2k`, `magistral-6k`, `magistral-8k`, `magistral-max`	Mistral model optimization
OpenAI O1	`o1-low`, `o1-medium`, `o1-high`, `o1-none`	Reasoning effort control
Qwen3	`qwen3-think-2k`, `qwen3-think-4k`, `qwen3-think-max`	Qwen reasoning modes
Seed-OSS	`seedoss-0k`, `seedoss-2k`, `seedoss-4k`, `seedoss-6k`, `seedoss-unltd`	Open-source thinking budget

Sampler Parameters¶

Standard parameters: - temperature: Sampling temperature (0.0 = greedy, 1.0 = diverse) - top_p: Nucleus sampling probability cutoff - max_tokens: Maximum output length

Model-specific parameters: - reasoning_effort: For reasoning-capable models (low/medium/high) - min_p: Minimum probability threshold for token sampling - repetition_penalty: Penalize repeated tokens - chat_template_kwargs: Special chat template features - thinking: Enable/disable explicit reasoning mode - keep_cots: Preserve Chain-of-Thought in output - skip_special_tokens: Whether to suppress special tokens - stop_token_ids: Token IDs that stop generation

Sampler Selection¶

Choose based on your evaluation goals:

Goal	Recommended Samplers
Reproducible benchmarking	`greedy-4k`, `greedy-max`
Mistral models	`magistral-6k`, `magistral-8k`
OpenAI O1 models	`o1-medium`, `o1-high`
Qwen models	`qwen3-think-4k`, `qwen3-think-max`
Resource constraints	`*-2k` variants
Complex tasks	`-8k`, `-max` variants

Usage¶

python runner.py \
    --config configs/r12.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333

Integration with runner.py¶

Templates and samplers are specified at runtime:

python runner.py \
  --config configs/r12.yaml \      # Stage 1: What to test
  --template templates/zerocot-nosys.json \  # Stage 2: How to prompt
  --sampler samplers/greedy-4k.json \        # Stage 2: How to generate
  --model your-model \
  --apibase http://localhost:3333

The combination of (model, template, sampler) creates a unique evaluation identity (eval_id).

Templates and Samplers: Stage 2 Execution Configuration¶

Overview¶

Templates: Prompting Strategies¶

Universal Templates¶

Model-Specific Templates¶

Usage¶

Samplers: Generation Parameter Control¶

Available Samplers¶

Common Sampler Families¶

Sampler Parameters¶

Sampler Selection¶

Usage¶

Integration with runner.py¶

See Also¶