Evaluate
ReasonScape Evaluator (evaluate.py)¶
evaluate.py groups results/ data into point and aggregation buckets and performs statistical processing on them, writing a bucket.json
usage: evaluate.py [-h] --interview INTERVIEW [--output OUTPUT] [--offline] [--histogram SIZE COUNT] [--tokenizer TOKENIZER] [--fftsamples FFTSAMPLES] [--precision PRECISION]
Evaluate LLM interview results
options:
-h, --help show this help message and exit
--interview INTERVIEW
Path, glob pattern, or comma-separated list of NDJSON files
--output OUTPUT Write evaluation results to JSON file
--offline Enable answer re-processing
--histogram SIZE COUNT
Create token histograms with bucket SIZE and COUNT
--tokenizer TOKENIZER
Tokenizer name for FFT analysis
--fftsamples FFTSAMPLES
Number of samples to use for FFT calculation (default: 128)
--precision PRECISION
Set precision field for samples that do not have it
Use --output to specify the bucket.json filename.
Bucket Types¶
The evaluator creates several types of buckets for different levels of aggregation:
- Point Buckets (
btype: "point") - The most granular level with specific combinations of: - Model
- Template
- Parameter name (param_name)
- Density
- Precision
- Degree
- Base task
-
Task
-
Scenario Buckets (
btype: "scenario") - Aggregates performance across all tasks for a specific scenario: - Model
- Template
- Parameter name
- Density
- Precision
- Degree
- Base task = *
-
Task = *
-
Scenario Base Task Buckets (
btype: "scenario_base_task") - Aggregates performance for a specific base task within a scenario: - Model
- Template
- Parameter name
- Density
- Precision
- Degree
- Base task (specific)
- Task = *
Key Output Fields¶
- btype: Type of bucket (point, scenario, scenario_base_task)
- bcount: Number of constituent splits (for aggregated buckets)
- correct: Number of correct answers
- invalid: Number of invalid responses
- invalid_ratio: Ratio of invalid responses to total
- total: Total number of non-truncated samples
- truncated: Number of truncated responses
- truncated_ratio: Ratio of truncated responses to total
- hard_terminated: Number of hard-terminated responses
- adjusted_accuracy: Accuracy adjusted for guessing
- adjusted_successes: Correct answers minus expected guessing successes
- adjusted_trials: Total trials minus expected guessing trials
- adjusted_center: Center of Wilson confidence interval
- adjusted_margin: Margin of Wilson confidence interval
- completion_tokens_mean: Average completion tokens per response
- completion_tokens_correct_mean: Average completion tokens for correct responses
- completion_tokens_incorrect_mean: Average completion tokens for incorrect responses
- prompt_tokens_mean: Average prompt tokens per request
- total_tokens: Total token usage
- params: Task-specific parameters
- histogram: Token usage histograms (when enabled)
- fft: Frequency analysis results (when enabled)
Bucket Type Examples¶
Point Bucket Example¶
{
"Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k+movies+003_movies_choice_count-12_reference_count-3": {
"model": "Phi-4-mini-instruct-fp16",
"template": "zerocot-nosys",
"param_name": "greedy-4k",
"scenario": "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k",
"base_task": "movies",
"task": "003_movies_choice_count-12_reference_count-3",
"btype": "point",
"correct": 337,
"invalid": 6,
"invalid_ratio": 0.006756756756756757,
"total": 888,
"truncated": 8,
"truncated_ratio": 0.008928571428571428,
"hard_terminated": 0,
"params": {
"reference_count": 3,
"choice_count": 12,
"count": 128
},
"adjusted_accuracy": 0.3230958230958231,
"adjusted_successes": 263.0,
"adjusted_trials": 814.0,
"adjusted_center": 0.3239267848444002,
"adjusted_margin": 0.03206244563179326,
"completion_tokens_mean": 388.35247747747746,
"completion_tokens_correct_mean": 377.8991097922849,
"completion_tokens_incorrect_mean": 394.7459165154265,
"prompt_tokens_mean": 132.36936936936937,
"total_tokens": 376857,
"histogram": {
"correct": {
"0": 0.0,
"50": 0.0,
"100": 0.0,
"150": 1.1869436201780414,
"200": 2.373887240356083,
"250": 10.385756676557863,
...
"1250": 0.0,
"1300": 0.0,
"1350": 0.0,
"1400": 0.0,
"1450": 0.0
},
"incorrect": {
"0": 0.0,
"50": 0.17889087656529518,
"100": 1.4311270125223614,
"150": 1.9677996422182469,
"200": 1.9677996422182469,
"250": 6.797853309481217,
...
"1250": 0.0,
"1300": 0.0,
"1350": 0.0,
"1400": 0.0,
"1450": 1.6100178890876566
}
},
"fft": {
"avg_spectrum": [
29.47362002266355,
27.00455019270718,
..
25.047122306471827,
25.047146221892223
],
"std_spectrum": [
0.2159618862893598,
0.6655491629616215,
...
1.1467863230041124,
1.1467736389641212
],
"frequencies": [
0.0,
0.00641025641025641,
...
0.49358974358974356,
0.5
]
}
}
}
Scenario Bucket Example¶
{
"Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k+null+null+null+*+*": {
"model": "Phi-4-mini-instruct-fp16",
"template": "zerocot-nosys",
"param_name": "greedy-4k",
"density": "null",
"precision": "null",
"degree": "null",
"scenario": "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k/null+null+null",
"base_task": "*",
"task": "*",
"btype": "scenario",
"bcount": 12,
"correct": 2456,
"invalid": 42,
"invalid_ratio": 0.004123,
"total": 6245,
"truncated": 67,
"truncated_ratio": 0.0106,
"hard_terminated": 0,
"params": {},
"adjusted_accuracy": 0.4123456789,
"adjusted_successes": 1876.0,
"adjusted_trials": 5890.0,
"adjusted_center": 0.4156789012,
"adjusted_margin": 0.0234567890,
"completion_tokens_mean": 412.567890,
"completion_tokens_correct_mean": 398.123456,
"completion_tokens_incorrect_mean": 428.987654,
"prompt_tokens_mean": 145.678901,
"total_tokens": 2876543
}
}
Scenario Base Task Bucket Example¶
{
"Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k+null+null+null+movies+*": {
"model": "Phi-4-mini-instruct-fp16",
"template": "zerocot-nosys",
"param_name": "greedy-4k",
"density": "null",
"precision": "null",
"degree": "null",
"scenario": "Phi-4-mini-instruct-fp16+zerocot-nosys+greedy-4k/null+null+null",
"base_task": "movies",
"task": "*",
"btype": "scenario_base_task",
"bcount": 5,
"correct": 1234,
"invalid": 18,
"invalid_ratio": 0.005678,
"total": 2876,
"truncated": 23,
"truncated_ratio": 0.008765,
"hard_terminated": 0,
"params": {},
"adjusted_accuracy": 0.4567890123,
"adjusted_successes": 987.0,
"adjusted_trials": 2567.0,
"adjusted_center": 0.4598765432,
"adjusted_margin": 0.0187654321,
"completion_tokens_mean": 398.765432,
"completion_tokens_correct_mean": 387.654321,
"completion_tokens_incorrect_mean": 412.345678,
"prompt_tokens_mean": 134.567890,
"total_tokens": 1234567
}
}
Statistical Processing¶
The evaluator performs several statistical computations on the results:
- Guessing Adjustment: Adjusts accuracy scores to account for random guessing using the Wilson score confidence interval
- Wilson Confidence Interval: Calculates 95% confidence intervals for adjusted accuracy scores
- Token Statistics: Computes mean completion tokens for correct and incorrect responses
- Validity Checking: Validates extracted answers against task-specific response enums
- Truncation Analysis: Identifies and reports truncated responses
Histogram Analysis¶
When the --histogram SIZE COUNT option is provided, the evaluator generates token usage histograms:
- Bucketing: Divides completion tokens into buckets of specified size
- Separate Tracking: Maintains separate histograms for correct and incorrect responses
- Percentage Output: Reports histogram values as percentages
FFT Analysis¶
When the --tokenizer TOKENIZER option is provided, the evaluator performs frequency analysis:
- Tokenization: Uses the specified tokenizer to convert responses to token sequences
- Frequency Domain Conversion: Applies FFT to analyze token sequence patterns
- Spectrum Analysis: Computes average and standard deviation of frequency spectra
- Sampling Control: Uses --fftsamples to limit the number of samples analyzed (default: 128)