Skip to content

Evaluate

ReasonScape Evaluator (evaluate.py)

evaluate.py groups results/ data into point and aggregation buckets and performs statistical processing on them, writing a bucket.json

usage: evaluate.py [-h] --interview INTERVIEW [--output OUTPUT] [--offline] [--histogram SIZE COUNT] [--tokenizer TOKENIZER] [--fftsamples FFTSAMPLES] [--precision PRECISION]

Evaluate LLM interview results

options:
  -h, --help            show this help message and exit
  --interview INTERVIEW
                        Path, glob pattern, or comma-separated list of NDJSON files
  --output OUTPUT       Write evaluation results to JSON file
  --offline             Enable answer re-processing
  --histogram SIZE COUNT
                        Create token histograms with bucket SIZE and COUNT
  --tokenizer TOKENIZER
                        Tokenizer name for FFT analysis
  --fftsamples FFTSAMPLES
                        Number of samples to use for FFT calculation (default: 128)
  --precision PRECISION
                        Set precision field for samples that do not have it

Use --output to specify the bucket.json filename.

Bucket Types

The evaluator creates several types of buckets for different levels of aggregation:

  1. Point Buckets (btype: "point") - The most granular level with specific combinations of:
  2. Model
  3. Template
  4. Parameter name (param_name)
  5. Density
  6. Precision
  7. Degree
  8. Base task
  9. Task

  10. Scenario Buckets (btype: "scenario") - Aggregates performance across all tasks for a specific scenario:

  11. Model
  12. Template
  13. Parameter name
  14. Density
  15. Precision
  16. Degree
  17. Base task = *
  18. Task = *

  19. Scenario Base Task Buckets (btype: "scenario_base_task") - Aggregates performance for a specific base task within a scenario:

  20. Model
  21. Template
  22. Parameter name
  23. Density
  24. Precision
  25. Degree
  26. Base task (specific)
  27. Task = *

Key Output Fields

  • btype: Type of bucket (point, scenario, scenario_base_task)
  • bcount: Number of constituent splits (for aggregated buckets)
  • correct: Number of correct answers
  • invalid: Number of invalid responses
  • invalid_ratio: Ratio of invalid responses to total
  • total: Total number of non-truncated samples
  • truncated: Number of truncated responses
  • truncated_ratio: Ratio of truncated responses to total
  • hard_terminated: Number of hard-terminated responses
  • adjusted_accuracy: Accuracy adjusted for guessing
  • adjusted_successes: Correct answers minus expected guessing successes
  • adjusted_trials: Total trials minus expected guessing trials
  • adjusted_center: Center of Wilson confidence interval
  • adjusted_margin: Margin of Wilson confidence interval
  • completion_tokens_mean: Average completion tokens per response
  • completion_tokens_correct_mean: Average completion tokens for correct responses
  • completion_tokens_incorrect_mean: Average completion tokens for incorrect responses
  • prompt_tokens_mean: Average prompt tokens per request
  • total_tokens: Total token usage
  • params: Task-specific parameters
  • histogram: Token usage histograms (when enabled)
  • fft: Frequency analysis results (when enabled)

Bucket Type Examples

Point Bucket Example

{
  "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k+movies+003_movies_choice_count-12_reference_count-3": {
      "model": "Phi-4-mini-instruct-fp16",
      "template": "zerocot-nosys",
      "param_name": "greedy-4k",
      "scenario": "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k",
      "base_task": "movies",
      "task": "003_movies_choice_count-12_reference_count-3",
      "btype": "point",
      "correct": 337,
      "invalid": 6,
      "invalid_ratio": 0.006756756756756757,
      "total": 888,
      "truncated": 8,
      "truncated_ratio": 0.008928571428571428,
      "hard_terminated": 0,
      "params": {
        "reference_count": 3,
        "choice_count": 12,
        "count": 128
      },
      "adjusted_accuracy": 0.3230958230958231,
      "adjusted_successes": 263.0,
      "adjusted_trials": 814.0,
      "adjusted_center": 0.3239267848444002,
      "adjusted_margin": 0.03206244563179326,
      "completion_tokens_mean": 388.35247747747746,
      "completion_tokens_correct_mean": 377.8991097922849,
      "completion_tokens_incorrect_mean": 394.7459165154265,
      "prompt_tokens_mean": 132.36936936936937,
      "total_tokens": 376857,
      "histogram": {
        "correct": {
          "0": 0.0,
          "50": 0.0,
          "100": 0.0,
          "150": 1.1869436201780414,
          "200": 2.373887240356083,
          "250": 10.385756676557863,
          ...
          "1250": 0.0,
          "1300": 0.0,
          "1350": 0.0,
          "1400": 0.0,
          "1450": 0.0
        },
        "incorrect": {
          "0": 0.0,
          "50": 0.17889087656529518,
          "100": 1.4311270125223614,
          "150": 1.9677996422182469,
          "200": 1.9677996422182469,
          "250": 6.797853309481217,
          ...
          "1250": 0.0,
          "1300": 0.0,
          "1350": 0.0,
          "1400": 0.0,
          "1450": 1.6100178890876566
        }
      },
      "fft": {
        "avg_spectrum": [
          29.47362002266355,
          27.00455019270718,
          ..
          25.047122306471827,
          25.047146221892223
        ],
        "std_spectrum": [
          0.2159618862893598,
          0.6655491629616215,
          ...
          1.1467863230041124,
          1.1467736389641212
        ],
        "frequencies": [
          0.0,
          0.00641025641025641,
          ...
          0.49358974358974356,
          0.5
        ]
      }
  }
}

Scenario Bucket Example

{
  "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k+null+null+null+*+*": {
    "model": "Phi-4-mini-instruct-fp16",
    "template": "zerocot-nosys",
    "param_name": "greedy-4k",
    "density": "null",
    "precision": "null",
    "degree": "null",
    "scenario": "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k/null+null+null",
    "base_task": "*",
    "task": "*",
    "btype": "scenario",
    "bcount": 12,
    "correct": 2456,
    "invalid": 42,
    "invalid_ratio": 0.004123,
    "total": 6245,
    "truncated": 67,
    "truncated_ratio": 0.0106,
    "hard_terminated": 0,
    "params": {},
    "adjusted_accuracy": 0.4123456789,
    "adjusted_successes": 1876.0,
    "adjusted_trials": 5890.0,
    "adjusted_center": 0.4156789012,
    "adjusted_margin": 0.0234567890,
    "completion_tokens_mean": 412.567890,
    "completion_tokens_correct_mean": 398.123456,
    "completion_tokens_incorrect_mean": 428.987654,
    "prompt_tokens_mean": 145.678901,
    "total_tokens": 2876543
  }
}

Scenario Base Task Bucket Example

{
  "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k+null+null+null+movies+*": {
    "model": "Phi-4-mini-instruct-fp16",
    "template": "zerocot-nosys",
    "param_name": "greedy-4k",
    "density": "null",
    "precision": "null",
    "degree": "null",
    "scenario": "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k/null+null+null",
    "base_task": "movies",
    "task": "*",
    "btype": "scenario_base_task",
    "bcount": 5,
    "correct": 1234,
    "invalid": 18,
    "invalid_ratio": 0.005678,
    "total": 2876,
    "truncated": 23,
    "truncated_ratio": 0.008765,
    "hard_terminated": 0,
    "params": {},
    "adjusted_accuracy": 0.4567890123,
    "adjusted_successes": 987.0,
    "adjusted_trials": 2567.0,
    "adjusted_center": 0.4598765432,
    "adjusted_margin": 0.0187654321,
    "completion_tokens_mean": 398.765432,
    "completion_tokens_correct_mean": 387.654321,
    "completion_tokens_incorrect_mean": 412.345678,
    "prompt_tokens_mean": 134.567890,
    "total_tokens": 1234567
  }
}

Statistical Processing

The evaluator performs several statistical computations on the results:

  • Guessing Adjustment: Adjusts accuracy scores to account for random guessing using the Wilson score confidence interval
  • Wilson Confidence Interval: Calculates 95% confidence intervals for adjusted accuracy scores
  • Token Statistics: Computes mean completion tokens for correct and incorrect responses
  • Validity Checking: Validates extracted answers against task-specific response enums
  • Truncation Analysis: Identifies and reports truncated responses

Histogram Analysis

When the --histogram SIZE COUNT option is provided, the evaluator generates token usage histograms: - Bucketing: Divides completion tokens into buckets of specified size - Separate Tracking: Maintains separate histograms for correct and incorrect responses - Percentage Output: Reports histogram values as percentages

FFT Analysis

When the --tokenizer TOKENIZER option is provided, the evaluator performs frequency analysis: - Tokenization: Uses the specified tokenizer to convert responses to token sequences - Frequency Domain Conversion: Applies FFT to analyze token sequence patterns - Spectrum Analysis: Computes average and standard deviation of frequency spectra - Sampling Control: Uses --fftsamples to limit the number of samples analyzed (default: 128)