Skip to content

Templates and Samplers: Stage 2 Execution Configuration

Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology - config.md - Experiment configuration (Stage 1)

Overview

Templates and samplers are Stage 2: Execution components that control how models are prompted and how they generate responses. While experiment configs (Stage 1) define WHAT to test, templates and samplers define HOW to execute those tests.

Location: - Templates: templates/*.json - Samplers: samplers/*.json

Used by: runner.py during evaluation execution


Templates: Prompting Strategies

Templates define how test cases are transformed into conversations — the logical system/user/assistant turn structure that the runner submits to the model. This is distinct from the model's chat template (a tokenizer-level concern), which is the subsequent stage that converts the conversation structure into raw text with special delimiter tokens before tokenization.

Templates fall into two categories: universal templates that work with any model, and model-specific templates that inject proprietary system prompts required to engage a particular model's reasoning mode.

Universal Templates

Template System Prompt Format Use Case
zeroshot Task description Direct answer Baseline; instruction-tuned models
zeroshot-nosys None Direct answer System-prompt ablation
zerocot-nosys None Step-by-step Default for reasoning models

zerocot-nosys is the standard template for reason-tuned models. It makes no assumptions about how the model should reason — the model performs its own expansion and reduction.

Model-Specific Templates

These templates inject a system prompt that activates a vendor-specific reasoning mode. Use only with the indicated model family.

Note: These are a kludge. Reasoning activation is a generation concern and belongs in the sampler — exposed as reasoning_effort, with the mechanics handled inside the model's chat template. Well-behaved models (gpt-oss, solar, mistral4) do exactly this. The model-specific templates here exist only for models whose serving stacks do not expose reasoning_effort, requiring reasoning activation to be smuggled in via system prompt at the conversation level instead.

Template Model Family System Prompt Format
llama-nemotron Llama Nemotron (70B+) "detailed thinking on" Direct answer
llama-nemotron-8b Llama Nemotron 8B "detailed thinking on" Step-by-step
magistral Magistral [THINK]/[/THINK] format Direct answer

magistral is a double-kludge: it requires both this template (reasoning activation, since reasoning_effort is unsupported) and samplers/magistral-max (which disables vLLM's special token suppression). Magistral marks thinking tokens as special tokens, and vLLM strips them by default — removing the delimiters and making the reasoning trace unparseable. Both files are required together.

Usage

python runner.py \
    --config configs/r12.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333

Samplers: Generation Parameter Control

Samplers define the generation parameters for LLM inference, controlling how models produce responses.

Available Samplers

Samplers are JSON files in samplers/ with OpenAI-compatible parameters:

{
  "temperature": 0.0,
  "top_p": 1.0,
  "max_tokens": 16384
}

Common Sampler Families

Sampler Class Examples Purpose
Greedy greedy-2k, greedy-4k, greedy-8k, greedy-16k, greedy-max Deterministic reproducible evaluation (greedy-16k is r12's context baseline)
Mistral magistral-2k, magistral-6k, magistral-8k, magistral-max Mistral model optimization
OpenAI O1 o1-low, o1-medium, o1-high, o1-none Reasoning effort control
Qwen3 qwen3-think-2k, qwen3-think-4k, qwen3-think-max Qwen reasoning modes
Seed-OSS seedoss-0k, seedoss-2k, seedoss-4k, seedoss-6k, seedoss-unltd Open-source thinking budget

Sampler Parameters

Standard parameters: - temperature: Sampling temperature (0.0 = greedy, 1.0 = diverse) - top_p: Nucleus sampling probability cutoff - max_tokens: Maximum output length

Model-specific parameters: - reasoning_effort: For reasoning-capable models (low/medium/high) - min_p: Minimum probability threshold for token sampling - repetition_penalty: Penalize repeated tokens - chat_template_kwargs: Special chat template features - thinking: Enable/disable explicit reasoning mode - keep_cots: Preserve Chain-of-Thought in output - skip_special_tokens: Whether to suppress special tokens - stop_token_ids: Token IDs that stop generation

Sampler Selection

Choose based on your evaluation goals:

Goal Recommended Samplers
Reproducible benchmarking greedy-4k, greedy-max
Mistral models magistral-6k, magistral-8k
OpenAI O1 models o1-medium, o1-high
Qwen models qwen3-think-4k, qwen3-think-max
Resource constraints *-2k variants
Complex tasks *-8k, *-max variants

Usage

python runner.py \
    --config configs/r12.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333

Integration with runner.py

Templates and samplers are specified at runtime:

python runner.py \
  --config configs/r12.yaml \      # Stage 1: What to test
  --template templates/zerocot-nosys.json \  # Stage 2: How to prompt
  --sampler samplers/greedy-4k.json \        # Stage 2: How to generate
  --model your-model \
  --apibase http://localhost:3333

The combination of (model, template, sampler) creates a unique evaluation identity (eval_id).

See Also