Skip to content

The Core Insight: LLMs as Information Processors

ReasonScape is built on the idea that the evaluation regime and the LLM form a system and that it is this combined system we observe. These systems are far more then merely text-in and text-out, applying this insight allows us to map capabilities and understand failure mode.

graph LR
    A[Task Definitions] --> C{Test Generator}
    B[Difficulty Parameters] --> C
    C -->|Test:text| D{Template}
    C -->|Target:text| K
    D -->|Test:prompt| E[[LLM Tokenizer]]
    E -->|Input:tokens| F((LLM Inference))
    F -->|Output:logprobs| G{Sampling}
    G -->|Output:tokens| H[[LLM Tokenizer]]
    H -->|Output:text| I{Parse}
    I -->|Thought:text| J[Analysis]
    I -->|Answer:text| K{Compare}
    K -->|Success:bool| J

    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style J fill:#f3e5f5
    style K fill:#fff3e0
    style D fill:#fff3e0
    style I fill:#fff3e0
    style G fill:#fff3e0
    style C fill:#e8f5e8
  1. Task Generator → Creates parametric test instances (text)
  2. Template Engine → Applies model-specific chat formatting (text → text)
  3. Tokenizer → Converts to token sequences (text → tokens)
  4. LLM Inference → Processes through learned parameters (tokens → logprobs)
  5. Sampler → Collapses distributions to concrete outputs (logprobs → tokens)
  6. Detokenizer → Converts back to text (tokens → text)
  7. Parser → Extracts reasoning and answers (text → structured data)
  8. Evaluator → Compares with targets (structured data → statistics)

We decompose this system with a flexible implementation that provides full control over the entire pipeline from input generation to final evaluation, and offers forensic analysis tools to help understand and isolate failure sources:

  • Spectral Analysis (FFT): Understand how tokenization affects problem representation
  • Compression Analysis: Measure information quality in reasoning traces
  • Hazard Analysis: Track when and how reasoning degrades over time
  • Surface Analysis: Map capability boundaries in parameter space
  • Capacity Analysis: Map context-limit sensitivity and attention decay breakdown.

See Also