Skip to content

M12X: Comprehensive 12-Domain Reasoning Evaluation

The M12X evaluation suite represents the most comprehensive assessment of large language model reasoning capabilities, featuring 12 cognitive domains with progressive difficulty scaling and flexible resource utilization.

Overview

M12X (Multi-domain 12-task eXtended) is ReasonScape's flagship evaluation methodology, designed to provide thorough assessment across diverse reasoning capabilities while maintaining statistical rigor and computational efficiency.

Key Features

  • 12 Cognitive Domains: Comprehensive coverage of reasoning capabilities
  • Progressive Difficulty: 3-degree scaling from easy to hard
  • Flexible Precision: Independent resource utilization control
  • Statistical Rigor: Confidence intervals, excess accuracy correction
  • Hierarchical Sampling: Perfect subset scaling for efficient evaluation

See Methodology for more details.

Cognitive Domains

M12X evaluates across twelve distinct reasoning domains that provide a comprehensive assessment of:

  • Mathematical and logical reasoning
  • Complex instruction following
  • Spatial and temporal processing
  • Pattern recognition and prediction
  • Structural parsing and syntax
  • Planning and algorithmic thinking

See Tasks for additional details.

Resource Usage

Model Total Tokens Avg Tokens/Completion Total Tests Arithmetic Tests Boolean Tests Brackets Tests Cars Tests Dates Tests Letters Tests Movies Tests Objects Tests Sequence Tests Shapes Tests Shuffle Tests Sort Tests
GPT-OSS-120B (MX4) (easy) 8,074,603 762 11,451 1,088 574 831 2,560 543 544 448 1,216 544 1,439 1,056 608
GPT-OSS-120B (MX4) (medium) 20,362,466 1053 19,175 1,792 2,323 1,625 2,912 830 736 1,408 1,632 864 1,597 2,304 1,152
GPT-OSS-120B (MX4) (hard) 28,578,417 1324 21,613 2,015 2,131 1,925 2,816 1,181 864 1,917 2,240 1,024 1,629 2,528 1,343
Qwen3-Next-80B-A3B Instruct (AWQ) (easy) 15,880,500 1232 12,098 1,630 607 916 2,589 511 703 384 1,184 858 1,184 608 924
Qwen3-Next-80B-A3B Instruct (AWQ) (medium) 38,713,942 1612 21,696 3,054 2,846 2,244 2,918 798 893 1,408 1,759 1,076 1,216 2,207 1,277
Qwen3-Next-80B-A3B Instruct (AWQ) (hard) 52,309,582 1937 24,870 3,521 2,736 2,657 2,948 1,275 1,081 1,888 2,451 1,143 1,248 2,586 1,336
Qwen3-32B (AWQ) (easy) 23,563,982 1678 12,707 1,481 652 1,485 2,094 511 957 512 1,536 572 1,312 768 827
Qwen3-32B (AWQ) (medium) 57,049,799 2173 22,494 2,729 2,892 2,262 2,509 828 1,420 1,536 2,335 814 1,344 2,619 1,206
Qwen3-32B (AWQ) (hard) 70,129,352 2544 23,973 2,828 2,677 1,753 2,821 1,242 1,501 2,048 2,608 820 1,344 3,328 1,003
Seed-OSS 36B (AWQ) (easy) 25,350,999 2510 9,150 1,138 653 601 1,523 508 589 416 1,119 233 1,324 598 448
Seed-OSS 36B (AWQ) (medium) 51,840,440 3305 13,774 2,107 2,030 700 1,814 793 463 1,277 1,331 134 1,321 1,084 720
Seed-OSS 36B (AWQ) (hard) 61,035,328 3805 14,017 2,275 1,841 320 1,801 1,235 228 1,721 1,714 113 1,303 710 756
GPT-OSS-20B (MX4) (easy) 16,131,603 1168 13,647 1,256 633 978 2,601 509 1,023 444 1,919 574 1,630 1,152 928
GPT-OSS-20B (MX4) (medium) 38,321,588 1593 21,597 2,303 2,257 1,454 2,757 796 1,436 1,462 2,518 937 1,661 2,586 1,430
GPT-OSS-20B (MX4) (hard) 51,976,561 2069 23,217 2,455 2,089 1,327 2,705 1,147 1,502 1,932 2,597 1,067 1,787 3,360 1,249
Qwen3-14B (AWQ) (easy) 27,150,343 1869 12,993 1,683 981 1,079 2,232 540 860 480 1,696 531 1,216 800 895
Qwen3-14B (AWQ) (medium) 60,443,268 2362 22,226 3,356 2,963 1,488 2,605 827 1,082 1,472 2,589 793 1,248 2,526 1,277
Qwen3-14B (AWQ) (hard) 72,235,743 2758 23,448 3,747 3,002 827 2,902 1,273 1,409 2,016 2,537 855 1,280 2,566 1,034
Ring Flash 2.0 (AWQ) (easy) 36,859,607 3183 9,917 1,212 438 514 2,124 473 748 416 1,275 309 980 766 662
Ring Flash 2.0 (AWQ) (medium) 71,545,143 3836 15,516 1,789 1,845 570 2,279 726 938 1,375 1,542 401 951 2,174 926
Ring Flash 2.0 (AWQ) (hard) 77,024,066 4229 14,924 1,470 1,699 365 2,180 1,105 584 1,854 1,741 394 839 1,929 764
Qwen3-30B-A3B Original (AWQ) (easy) 32,880,717 2283 13,303 1,448 894 1,214 2,416 479 1,216 512 1,600 558 1,216 861 889
Qwen3-30B-A3B Original (AWQ) (medium) 71,858,396 2870 22,115 2,650 3,155 1,805 2,809 766 1,482 1,536 2,358 608 1,311 2,480 1,155
Qwen3-30B-A3B Original (AWQ) (hard) 81,544,242 3251 22,219 2,508 3,013 1,306 2,755 1,148 1,136 2,048 2,615 598 1,375 2,838 879
Phi-4 Reasoning (FP16) (easy) 32,481,996 2057 12,834 1,923 566 1,439 2,094 491 1,468 438 1,299 461 1,097 790 768
Phi-4 Reasoning (FP16) (medium) 60,383,587 2600 19,619 3,505 2,086 1,405 1,887 800 1,492 1,414 1,725 671 1,118 2,321 1,195
Phi-4 Reasoning (FP16) (hard) 70,094,680 3062 19,748 3,580 2,030 549 1,571 1,201 1,354 1,890 2,327 715 966 2,477 1,088
Qwen3-30B-A3B DeepSeek v3.1 Distill (FP8) (easy) 36,762,408 3102 10,274 1,127 551 953 1,806 445 556 480 1,312 561 955 795 733
Qwen3-30B-A3B DeepSeek v3.1 Distill (FP8) (medium) 68,031,485 3803 15,498 2,074 2,468 576 1,657 699 458 1,376 1,457 573 955 1,967 1,238
Qwen3-30B-A3B DeepSeek v3.1 Distill (FP8) (hard) 71,725,328 4208 14,888 2,265 2,178 206 1,330 1,078 202 1,855 1,454 554 859 1,739 1,168
Apriel-1.5-15b-Thinker (FP16) (easy) 25,069,379 2100 9,969 685 233 644 2,160 569 813 444 1,168 533 1,092 817 811
Apriel-1.5-15b-Thinker (FP16) (medium) 50,003,151 2821 13,049 630 1,023 584 2,293 823 720 1,327 1,228 371 1,142 1,934 974
Apriel-1.5-15b-Thinker (FP16) (hard) 55,723,294 3374 11,467 338 992 273 2,177 1,232 359 1,756 970 228 1,140 1,382 620
QwQ 32B (AWQ) (easy) 36,511,724 2886 11,449 1,125 362 881 2,321 477 820 603 1,656 470 1,075 895 764
QwQ 32B (AWQ) (medium) 73,382,414 3672 17,020 1,546 1,390 643 2,128 732 995 1,651 2,346 505 1,072 2,926 1,086
QwQ 32B (AWQ) (hard) 77,741,526 4107 16,034 1,211 1,361 322 2,108 1,047 604 2,224 2,544 654 956 2,157 846
Apriel-Nemotron-1.5-15b-Thinker (FP16) (easy) 25,678,966 1494 14,915 2,410 871 1,980 2,238 542 949 448 1,753 764 1,233 959 768
Apriel-Nemotron-1.5-15b-Thinker (FP16) (medium) 54,062,782 1950 24,046 4,166 2,338 2,578 2,715 829 1,401 1,440 2,571 891 1,295 2,897 925
Apriel-Nemotron-1.5-15b-Thinker (FP16) (hard) 64,123,746 2255 24,958 4,455 2,443 1,926 2,932 1,274 1,475 1,919 2,488 956 1,295 3,082 713
aquif-3.5 8B (FP16) (easy) 24,717,514 1395 15,476 2,569 854 1,544 2,612 575 1,191 512 1,888 670 1,365 896 800
aquif-3.5 8B (FP16) (medium) 50,795,435 1775 25,363 4,480 2,972 2,133 2,912 894 1,480 1,408 2,613 923 1,427 3,226 895
aquif-3.5 8B (FP16) (hard) 59,288,272 2100 25,260 4,553 2,845 1,343 2,980 1,371 1,378 1,920 2,345 957 1,432 3,404 732
Qwen3-Next-80B-A3B Thinking (AWQ) (easy) 31,840,100 3453 7,954 918 318 476 1,604 413 541 480 987 152 949 701 415
Qwen3-Next-80B-A3B Thinking (AWQ) (medium) 59,204,594 4197 11,779 1,360 1,337 485 1,443 699 681 1,466 1,135 69 1,004 1,472 628
Qwen3-Next-80B-A3B Thinking (AWQ) (hard) 66,296,992 4698 11,285 1,099 1,245 274 1,442 1,046 513 1,941 964 48 887 1,139 687
Magistral Small 2509 (FP8) (easy) 29,523,753 1654 15,213 2,943 1,012 1,327 2,337 603 1,303 512 1,280 798 1,405 768 925
Magistral Small 2509 (FP8) (medium) 56,711,835 2066 24,662 4,609 3,185 1,343 2,811 952 1,515 1,504 2,206 1,084 1,437 2,747 1,269
Magistral Small 2509 (FP8) (hard) 63,443,463 2343 25,503 4,305 2,930 771 2,902 1,331 1,498 2,016 2,582 1,087 1,375 3,659 1,047
Llama-Nemotron-Super 49B v1.5 (INT8) (easy) 31,021,098 2606 10,043 1,553 356 639 1,609 511 860 543 1,213 276 1,029 765 689
Llama-Nemotron-Super 49B v1.5 (INT8) (medium) 61,560,487 3205 15,991 2,061 1,435 341 1,937 764 1,329 1,375 1,866 492 1,118 2,172 1,101
Llama-Nemotron-Super 49B v1.5 (INT8) (hard) 71,234,100 3640 16,576 1,546 1,338 122 2,176 1,143 1,452 1,854 2,180 550 1,264 2,026 925
GLM-4.5 Air (AWQ) (easy) 36,747,440 2636 11,721 1,791 699 943 2,110 478 604 448 1,632 685 1,001 730 600
GLM-4.5 Air (AWQ) (medium) 68,265,410 3153 18,440 2,589 2,695 875 2,369 733 382 1,375 1,433 1,046 982 2,920 1,041
GLM-4.5 Air (AWQ) (hard) 78,140,372 3548 19,174 2,188 2,502 477 2,674 1,082 216 1,854 2,013 1,106 941 2,977 1,144
Hunyuan A13B-Instruct (GPTQ) (easy) 24,537,772 1431 16,067 2,190 880 1,547 2,239 571 1,363 512 2,304 827 1,594 1,184 856
Hunyuan A13B-Instruct (GPTQ) (medium) 53,254,183 1884 26,138 4,027 2,918 2,282 2,748 857 1,400 1,632 2,816 1,057 1,528 3,838 1,035
Hunyuan A13B-Instruct (GPTQ) (hard) 62,584,085 2249 24,903 4,040 2,517 1,627 2,895 1,364 871 2,208 2,102 1,016 1,370 4,169 724
Qwen3-8B Original (FP16) (easy) 35,111,857 2590 12,258 1,677 756 474 2,235 603 909 512 1,950 498 1,050 831 763
Qwen3-8B Original (FP16) (medium) 76,998,713 3260 20,779 3,014 2,954 419 2,584 890 1,417 1,568 2,557 573 1,050 2,543 1,210
Qwen3-8B Original (FP16) (hard) 95,954,003 3621 23,081 3,449 3,462 284 2,538 1,366 979 2,432 2,268 540 1,148 3,603 1,012
Qwen3-30B-A3B Instruct-2507 (AWQ) (easy) 20,178,612 1353 12,385 1,593 673 1,091 1,920 605 1,182 544 1,664 729 881 831 672
Qwen3-30B-A3B Instruct-2507 (AWQ) (medium) 40,127,418 1754 19,707 3,050 2,274 762 2,103 987 1,498 1,504 2,399 942 879 2,639 670
Qwen3-30B-A3B Instruct-2507 (AWQ) (hard) 50,186,835 2116 21,435 3,734 2,337 338 2,094 1,462 1,498 2,016 2,841 970 815 2,765 565
Qwen3-4B Thinking-2507 (FP16) (easy) 49,757,627 4384 10,065 1,840 433 427 1,695 500 626 511 1,557 412 828 671 565
Qwen3-4B Thinking-2507 (FP16) (medium) 82,503,525 5073 14,051 2,681 1,827 263 1,567 751 372 1,405 1,656 163 823 1,740 803
Qwen3-4B Thinking-2507 (FP16) (hard) 83,141,988 5415 12,500 2,091 1,514 80 1,430 1,116 174 1,913 1,527 144 719 1,187 605
Qwen3-4B Original (FP16) (easy) 39,463,143 2472 14,124 2,181 1,068 778 2,523 634 1,202 544 1,599 550 1,213 1,010 822
Qwen3-4B Original (FP16) (medium) 78,796,516 3151 22,477 3,555 3,031 686 2,808 947 1,475 1,536 2,411 396 1,458 3,142 1,032
Qwen3-4B Original (FP16) (hard) 89,893,549 3569 22,641 3,324 2,995 396 2,841 1,451 1,117 2,080 2,312 414 1,395 3,532 784
Qwen3-4B Instruct-2507 (FP16) (easy) 25,086,642 1456 15,716 2,797 1,037 853 2,157 633 1,213 512 1,888 895 1,624 1,119 988
Qwen3-4B Instruct-2507 (FP16) (medium) 49,710,158 1897 24,892 4,658 3,248 627 2,380 1,013 1,530 1,503 2,559 1,117 1,682 3,407 1,168
Qwen3-4B Instruct-2507 (FP16) (hard) 58,408,997 2331 25,285 4,592 3,085 197 2,329 1,521 1,353 2,015 2,783 1,149 1,660 3,636 965
Qwen3-30B-A3B Thinking-2507 (AWQ) (easy) 43,260,618 3568 10,331 1,267 436 683 1,508 476 919 415 1,437 719 971 744 756
Qwen3-30B-A3B Thinking-2507 (AWQ) (medium) 74,154,340 4301 14,124 2,048 1,510 184 1,457 729 704 1,370 1,585 390 1,032 2,001 1,114
Qwen3-30B-A3B Thinking-2507 (AWQ) (hard) 76,194,307 4667 13,035 1,887 1,396 61 1,257 1,077 230 1,912 1,297 254 985 1,756 923
Nemotron Nano 9B v2 (FP16) (easy) 26,431,618 1504 15,925 2,858 1,060 1,023 2,413 604 1,244 480 2,015 535 1,472 1,342 879
Nemotron Nano 9B v2 (FP16) (medium) 54,470,408 1978 25,088 4,325 3,089 985 2,846 986 1,491 1,408 2,486 784 1,787 3,967 934
Nemotron Nano 9B v2 (FP16) (hard) 61,480,100 2339 24,116 3,991 2,779 423 3,006 1,462 1,176 1,920 1,930 847 1,723 4,254 605
Hermes-4 14B (FP8) (easy) 31,012,323 1870 13,529 2,525 582 504 1,659 533 1,196 575 1,887 830 1,428 1,036 774
Hermes-4 14B (FP8) (medium) 61,401,273 2304 21,928 4,293 1,985 351 2,005 911 1,437 1,657 2,743 1,037 1,413 3,331 765
Hermes-4 14B (FP8) (hard) 68,721,936 2612 21,213 4,287 1,802 240 1,890 1,384 1,184 2,262 2,456 1,040 1,283 2,879 506
R1-0528-Qwen3-8B (FP16) (easy) 49,778,494 3087 13,892 2,024 642 1,003 2,032 555 942 480 2,164 512 1,686 991 861
R1-0528-Qwen3-8B (FP16) (medium) 83,511,661 3615 19,983 2,790 2,300 630 2,211 929 264 1,504 2,302 731 1,968 3,538 816
R1-0528-Qwen3-8B (FP16) (hard) 83,385,639 3887 18,647 2,065 2,069 201 2,251 1,358 221 2,047 1,429 816 1,758 3,803 629
Hermes-4 70B (AWQ) (easy) 30,019,066 1784 10,623 2,788 544 114 953 526 1,135 471 1,203 367 1,023 1,052 447
Hermes-4 70B (AWQ) (medium) 56,247,853 2273 16,716 3,811 1,775 82 1,174 834 1,165 1,448 1,658 501 998 2,712 558
Hermes-4 70B (AWQ) (hard) 63,248,929 2641 16,660 3,169 1,723 36 1,188 1,171 811 1,949 1,786 624 973 2,640 590
Ring Mini 2.0 (FP16) (easy) 44,576,866 3616 10,421 1,005 645 556 1,761 580 839 544 1,568 309 849 1,245 520
Ring Mini 2.0 (FP16) (medium) 85,774,200 4359 16,565 883 2,076 488 2,369 1,169 195 1,663 1,939 307 866 3,899 711
Ring Mini 2.0 (FP16) (hard) 83,837,586 4669 14,318 469 1,800 201 2,434 1,321 89 2,239 1,119 323 665 3,157 501
aquif-3.5 A4B (FP16) (easy) 38,124,382 3009 10,701 1,168 293 271 1,952 536 1,279 576 1,875 137 1,068 987 559
aquif-3.5 A4B (FP16) (medium) 68,910,321 3783 13,960 1,211 943 103 1,560 852 699 1,597 2,399 59 1,055 2,953 529
aquif-3.5 A4B (FP16) (hard) 68,980,479 4161 11,774 771 909 29 1,113 1,349 231 2,141 1,933 45 1,005 1,846 402
Gemma3-27B-It (FP16) (easy) 5,951,843 355 15,232 2,616 1,150 1,112 2,553 602 704 416 2,176 1,023 1,024 1,280 576
Gemma3-27B-It (FP16) (medium) 12,422,760 447 22,572 3,394 2,311 1,341 2,872 981 508 1,472 2,784 895 1,184 4,224 606
Gemma3-27B-It (FP16) (hard) 14,841,446 514 22,619 2,716 2,112 966 2,909 1,393 497 1,952 2,688 831 1,120 4,830 605
Llama-3.3-70B (FP8) (easy) 5,523,294 370 13,680 2,560 1,088 598 2,464 605 798 416 1,536 1,024 959 960 672
Llama-3.3-70B (FP8) (medium) 13,435,170 486 21,961 3,326 3,296 597 2,944 1,019 509 1,440 2,368 896 1,023 3,839 704
Llama-3.3-70B (FP8) (hard) 16,966,749 571 21,991 2,878 3,168 376 2,880 1,432 415 1,920 2,624 832 1,087 3,739 640
Phi-4 (FP16) (easy) 6,417,477 391 15,279 2,654 1,371 761 2,688 637 832 448 2,176 992 1,184 800 736
Phi-4 (FP16) (medium) 13,220,775 502 22,693 3,646 3,546 694 2,944 1,083 511 1,472 2,592 1,056 1,246 3,200 703
Phi-4 (FP16) (hard) 15,621,380 597 22,273 2,907 3,296 597 2,944 1,560 416 1,952 1,824 992 1,310 3,904 571
Hunyuan 7B-Instruct (FP16) (easy) 29,344,139 1907 13,729 414 1,000 561 2,187 858 1,465 472 2,496 857 1,608 864 947
Hunyuan 7B-Instruct (FP16) (medium) 54,216,234 2663 19,652 321 3,127 199 2,625 1,303 1,131 1,464 2,588 1,074 1,666 3,251 903
Hunyuan 7B-Instruct (FP16) (hard) 61,320,090 2998 19,147 185 2,908 96 2,657 1,778 778 1,995 1,905 1,103 1,463 3,742 537
Gemma3-12B-It (FP16) (easy) 6,299,479 357 15,121 2,329 1,292 639 2,609 478 768 512 2,656 894 1,152 1,184 608
Gemma3-12B-It (FP16) (medium) 13,264,347 463 21,569 2,593 3,091 863 2,778 892 508 1,600 2,912 540 1,184 3,936 672
Gemma3-12B-It (FP16) (hard) 15,543,276 552 21,629 2,079 2,781 863 2,808 1,369 406 2,112 2,336 414 1,120 4,701 640
granite-4.0-h small (FP16) (easy) 6,522,847 304 15,751 2,229 967 724 2,688 796 701 576 2,463 959 1,024 1,952 672
granite-4.0-h small (FP16) (medium) 12,541,567 377 21,364 2,675 1,758 799 2,912 1,274 536 1,760 2,298 767 1,246 4,669 670
granite-4.0-h small (FP16) (hard) 14,055,779 432 18,541 1,761 1,681 716 2,976 1,783 334 2,301 1,331 736 1,310 3,016 596
Hunyuan 4B-Instruct (FP16) (easy) 28,908,510 1852 14,138 953 370 1,402 2,643 694 1,329 480 2,431 153 1,781 1,183 719
Hunyuan 4B-Instruct (FP16) (medium) 51,396,935 2502 18,281 626 738 1,523 2,764 1,107 913 1,503 2,525 95 1,889 3,995 603
Hunyuan 4B-Instruct (FP16) (hard) 56,421,969 2961 16,623 326 761 592 2,672 1,611 560 2,013 1,629 86 1,764 4,172 437
R1-Distill-Llama-8B (FP16) (easy) 34,274,431 1693 17,190 1,686 1,382 1,006 2,592 952 1,362 574 2,176 873 2,065 1,742 780
R1-Distill-Llama-8B (FP16) (medium) 59,928,466 2223 22,121 1,946 3,078 537 2,743 1,429 773 1,728 1,568 734 2,223 4,672 690
R1-Distill-Llama-8B (FP16) (hard) 59,798,754 2477 20,347 1,642 2,610 321 2,560 1,904 467 2,336 1,107 717 2,040 4,133 510
Qwen3-1.7B (AWQ) (easy) 46,900,595 2516 16,747 2,270 1,205 551 2,553 889 1,095 576 2,303 1,095 1,561 1,917 732
Qwen3-1.7B (AWQ) (medium) 74,092,131 3205 20,765 2,993 1,824 276 2,980 1,392 407 1,728 1,662 1,019 1,687 4,240 557
Qwen3-1.7B (AWQ) (hard) 76,506,761 3621 19,411 1,955 1,734 139 2,841 1,868 261 2,304 1,009 1,006 1,560 4,343 391
ERNIE-4.5-21B-A3B Thinking (AWQ) (easy) 51,617,179 3465 12,160 1,295 881 372 2,161 597 873 572 2,089 380 707 1,567 666
ERNIE-4.5-21B-A3B Thinking (AWQ) (medium) 80,978,436 4119 15,210 1,370 2,779 293 2,052 968 186 1,712 2,089 176 599 2,461 525
ERNIE-4.5-21B-A3B Thinking (AWQ) (hard) 76,649,834 4451 12,946 831 2,567 173 1,944 1,435 170 2,310 1,116 122 584 1,263 431
Phi-4 Mini Reasoning (FP16) (easy) 54,232,057 3078 14,642 2,385 1,096 88 2,524 717 1,275 625 2,472 297 1,143 1,351 669
Phi-4 Mini Reasoning (FP16) (medium) 80,513,695 3669 16,630 3,237 2,083 32 1,778 1,092 786 1,498 1,887 265 1,107 2,417 448
Phi-4 Mini Reasoning (FP16) (hard) 72,414,552 4035 12,947 2,346 1,914 19 1,062 1,564 440 2,015 1,004 286 1,052 965 280
SmolLM3 3B (FP16) (easy) 32,781,845 1633 14,538 2,104 1,069 253 1,272 968 1,282 565 1,907 831 1,662 1,914 711
SmolLM3 3B (FP16) (medium) 55,699,395 2086 18,762 2,540 2,244 196 1,674 1,449 772 1,772 1,284 928 1,787 3,601 515
SmolLM3 3B (FP16) (hard) 53,593,811 2388 16,147 1,916 2,062 90 1,678 1,907 548 2,378 867 893 1,699 1,805 304
Gemma3-4B-It (FP16) (easy) 6,895,460 348 15,059 2,027 877 667 2,639 795 672 608 2,619 447 1,184 2,015 509
Gemma3-4B-It (FP16) (medium) 12,829,122 472 20,453 2,067 1,968 844 2,996 1,294 384 1,824 2,431 287 1,280 4,640 438
Gemma3-4B-It (FP16) (hard) 13,098,828 547 18,632 1,520 1,889 839 2,871 1,803 448 2,432 1,568 319 1,344 3,200 399
granite-3.3 8B Instruct (FP16) (easy) 9,376,264 593 16,089 1,632 1,532 787 2,752 923 671 544 2,368 561 1,663 2,016 640
granite-3.3 8B Instruct (FP16) (medium) 15,560,912 719 20,396 1,501 3,376 767 2,848 1,428 445 1,728 1,632 308 1,662 4,126 575
granite-3.3 8B Instruct (FP16) (hard) 16,371,771 771 18,390 1,240 3,083 517 2,848 1,874 377 2,368 1,054 305 1,342 2,936 446
Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) (easy) 41,113,288 3125 9,604 1,902 199 165 757 954 1,038 542 2,210 155 686 653 343
Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) (medium) 62,633,572 3994 9,798 1,902 755 46 729 1,448 291 1,480 1,483 101 678 669 216
Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) (hard) 64,379,784 4491 8,493 1,088 776 13 596 1,918 144 2,011 765 68 700 267 147
Llama-3.1-Nemotron-Nano-8B (FP16) (easy) 52,725,257 1206 22,010 4,321 582 997 2,558 1,102 1,214 999 3,559 151 3,656 2,259 612
Llama-3.1-Nemotron-Nano-8B (FP16) (medium) 53,888,942 1563 18,322 3,565 2,042 699 2,121 814 448 1,414 1,937 106 2,125 2,802 249
Llama-3.1-Nemotron-Nano-8B (FP16) (hard) 51,786,393 1725 15,900 2,735 1,698 595 2,036 1,189 384 1,889 1,126 93 2,076 1,865 214
AI21 Jamba Reasoning 3B (FP16) (easy) 49,040,340 3090 11,600 1,299 608 451 2,158 811 500 410 1,714 540 1,229 1,509 371
AI21 Jamba Reasoning 3B (FP16) (medium) 76,612,547 3877 17,259 1,700 2,838 517 2,826 1,250 314 1,409 1,465 469 1,251 2,850 370
AI21 Jamba Reasoning 3B (FP16) (hard) 76,016,642 4237 16,735 1,381 2,876 395 2,943 1,754 286 2,035 993 488 1,288 1,943 353
granite-4.0-h tiny (FP16) (easy) 7,369,188 207 13,679 1,601 715 391 2,432 985 480 640 1,949 503 1,408 1,952 623
granite-4.0-h tiny (FP16) (medium) 15,630,823 266 14,540 1,184 1,443 195 2,783 1,492 383 1,568 1,055 309 1,566 2,079 483
granite-4.0-h tiny (FP16) (hard) 19,347,688 298 13,820 810 1,417 63 3,008 2,000 384 2,047 647 309 1,374 1,427 334
granite-4.0-h micro (FP16) (easy) 4,560,449 256 14,168 1,703 1,476 607 2,912 986 639 640 765 672 1,438 1,888 442
granite-4.0-h micro (FP16) (medium) 7,731,848 336 17,259 1,540 3,159 895 2,880 1,431 470 1,824 760 320 1,278 2,304 398
granite-4.0-h micro (FP16) (hard) 8,598,854 383 16,392 1,255 2,802 895 2,816 1,877 374 2,432 746 288 1,022 1,533 352
Llama-3.1-8B (FP16) (easy) 13,071,823 366 13,986 1,365 591 200 2,910 956 857 534 2,622 353 1,113 1,950 535
Llama-3.1-8B (FP16) (medium) 21,715,813 512 15,059 1,326 1,266 156 2,811 1,204 462 1,772 1,977 190 1,195 2,099 601
Llama-3.1-8B (FP16) (hard) 24,165,191 583 13,839 1,059 1,219 85 2,750 1,681 365 2,376 1,178 179 1,105 1,412 430
Llama-3.2-3B (FP16) (easy) 8,164,023 334 12,588 1,161 803 317 2,823 972 373 633 2,432 188 920 1,625 341
Llama-3.2-3B (FP16) (medium) 14,517,297 382 14,312 1,269 1,003 308 2,829 1,391 365 1,784 1,951 187 982 1,892 351
Llama-3.2-3B (FP16) (hard) 16,644,490 429 13,613 1,034 961 273 2,769 1,797 373 2,379 1,116 186 950 1,469 306
Phi-4 Mini Flash Reasoning (FP16) (easy) 31,116,228 1963 5,679 698 45 87 620 832 445 94 1,387 113 1,065 157 136

Overall Totals

Unique Models: 53

Total Tokens (All Models): 7,109,826,885

Total Tests (All Models): 2,646,483

Evaluation Workflows

Rapid Assessment (2-3 hours)

python runner.py --config configs/m12x.yaml --degree 0 --density normal --precision low
- Quick capability screening across all 12 domains - Comprehensive sampling with basic statistical confidence - Efficient resource utilization

Standard Evaluation (8-12 hours)

python runner.py --config configs/m12x.yaml --degree 1 --density normal --precision medium
- Comprehensive assessment with moderate difficulty - Balanced parameter space coverage - Publication-ready statistical rigor

Research-Grade Analysis (20+ hours)

python runner.py --config configs/m12x.yaml --degree 2 --density normal --precision high
- Maximum difficulty revealing failure modes - Complete parameter space exploration - Research-grade confidence intervals and comprehensive cognitive architecture analysis

Citation

When using M12X in research, please cite:

@software{reasonscape_m12x2025,
  title={M12X: Comprehensive 12-Domain Reasoning Evaluation},
  author={Mikhail Ravkine},
  year={2025},
  url={https://github.com/the-crypt-keeper/reasonscape},
  note={Part of ReasonScape evaluation methodology}
}

See Also