M12X: Comprehensive 12-Domain Reasoning Evaluation¶

The M12X evaluation suite represents the most comprehensive assessment of large language model reasoning capabilities, featuring 12 cognitive domains with progressive difficulty scaling and flexible resource utilization.

Overview¶

M12X (Multi-domain 12-task eXtended) is ReasonScape's flagship evaluation methodology, designed to provide thorough assessment across diverse reasoning capabilities while maintaining statistical rigor and computational efficiency.

Key Features¶

12 Cognitive Domains: Comprehensive coverage of reasoning capabilities
Progressive Difficulty: 3-degree scaling from easy to hard
Flexible Precision: Independent resource utilization control
Statistical Rigor: Confidence intervals, excess accuracy correction
Hierarchical Sampling: Perfect subset scaling for efficient evaluation

See Methodology for more details.

Cognitive Domains¶

M12X evaluates across twelve distinct reasoning domains that provide a comprehensive assessment of:

Mathematical and logical reasoning
Complex instruction following
Spatial and temporal processing
Pattern recognition and prediction
Structural parsing and syntax
Planning and algorithmic thinking

See Tasks for additional details.

Resource Usage¶

Model	Total Tokens	Avg Tokens/Completion	Total Tests	Arithmetic Tests	Boolean Tests	Brackets Tests	Cars Tests	Dates Tests	Letters Tests	Movies Tests	Objects Tests	Sequence Tests	Shapes Tests	Shuffle Tests	Sort Tests
GPT-OSS-120B (MX4) (easy)	8,074,603	762	11,451	1,088	574	831	2,560	543	544	448	1,216	544	1,439	1,056	608
GPT-OSS-120B (MX4) (medium)	20,362,466	1053	19,175	1,792	2,323	1,625	2,912	830	736	1,408	1,632	864	1,597	2,304	1,152
GPT-OSS-120B (MX4) (hard)	28,578,417	1324	21,613	2,015	2,131	1,925	2,816	1,181	864	1,917	2,240	1,024	1,629	2,528	1,343
Qwen3-Next-80B-A3B Instruct (AWQ) (easy)	15,880,500	1232	12,098	1,630	607	916	2,589	511	703	384	1,184	858	1,184	608	924
Qwen3-Next-80B-A3B Instruct (AWQ) (medium)	38,713,942	1612	21,696	3,054	2,846	2,244	2,918	798	893	1,408	1,759	1,076	1,216	2,207	1,277
Qwen3-Next-80B-A3B Instruct (AWQ) (hard)	52,309,582	1937	24,870	3,521	2,736	2,657	2,948	1,275	1,081	1,888	2,451	1,143	1,248	2,586	1,336
Qwen3-32B (AWQ) (easy)	23,563,982	1678	12,707	1,481	652	1,485	2,094	511	957	512	1,536	572	1,312	768	827
Qwen3-32B (AWQ) (medium)	57,049,799	2173	22,494	2,729	2,892	2,262	2,509	828	1,420	1,536	2,335	814	1,344	2,619	1,206
Qwen3-32B (AWQ) (hard)	70,129,352	2544	23,973	2,828	2,677	1,753	2,821	1,242	1,501	2,048	2,608	820	1,344	3,328	1,003
Seed-OSS 36B (AWQ) (easy)	25,350,999	2510	9,150	1,138	653	601	1,523	508	589	416	1,119	233	1,324	598	448
Seed-OSS 36B (AWQ) (medium)	51,840,440	3305	13,774	2,107	2,030	700	1,814	793	463	1,277	1,331	134	1,321	1,084	720
Seed-OSS 36B (AWQ) (hard)	61,035,328	3805	14,017	2,275	1,841	320	1,801	1,235	228	1,721	1,714	113	1,303	710	756
GPT-OSS-20B (MX4) (easy)	16,131,603	1168	13,647	1,256	633	978	2,601	509	1,023	444	1,919	574	1,630	1,152	928
GPT-OSS-20B (MX4) (medium)	38,321,588	1593	21,597	2,303	2,257	1,454	2,757	796	1,436	1,462	2,518	937	1,661	2,586	1,430
GPT-OSS-20B (MX4) (hard)	51,976,561	2069	23,217	2,455	2,089	1,327	2,705	1,147	1,502	1,932	2,597	1,067	1,787	3,360	1,249
Qwen3-14B (AWQ) (easy)	27,150,343	1869	12,993	1,683	981	1,079	2,232	540	860	480	1,696	531	1,216	800	895
Qwen3-14B (AWQ) (medium)	60,443,268	2362	22,226	3,356	2,963	1,488	2,605	827	1,082	1,472	2,589	793	1,248	2,526	1,277
Qwen3-14B (AWQ) (hard)	72,235,743	2758	23,448	3,747	3,002	827	2,902	1,273	1,409	2,016	2,537	855	1,280	2,566	1,034
Ring Flash 2.0 (AWQ) (easy)	36,859,607	3183	9,917	1,212	438	514	2,124	473	748	416	1,275	309	980	766	662
Ring Flash 2.0 (AWQ) (medium)	71,545,143	3836	15,516	1,789	1,845	570	2,279	726	938	1,375	1,542	401	951	2,174	926
Ring Flash 2.0 (AWQ) (hard)	77,024,066	4229	14,924	1,470	1,699	365	2,180	1,105	584	1,854	1,741	394	839	1,929	764
Qwen3-30B-A3B Original (AWQ) (easy)	32,880,717	2283	13,303	1,448	894	1,214	2,416	479	1,216	512	1,600	558	1,216	861	889
Qwen3-30B-A3B Original (AWQ) (medium)	71,858,396	2870	22,115	2,650	3,155	1,805	2,809	766	1,482	1,536	2,358	608	1,311	2,480	1,155
Qwen3-30B-A3B Original (AWQ) (hard)	81,544,242	3251	22,219	2,508	3,013	1,306	2,755	1,148	1,136	2,048	2,615	598	1,375	2,838	879
Phi-4 Reasoning (FP16) (easy)	32,481,996	2057	12,834	1,923	566	1,439	2,094	491	1,468	438	1,299	461	1,097	790	768
Phi-4 Reasoning (FP16) (medium)	60,383,587	2600	19,619	3,505	2,086	1,405	1,887	800	1,492	1,414	1,725	671	1,118	2,321	1,195
Phi-4 Reasoning (FP16) (hard)	70,094,680	3062	19,748	3,580	2,030	549	1,571	1,201	1,354	1,890	2,327	715	966	2,477	1,088
Qwen3-30B-A3B DeepSeek v3.1 Distill (FP8) (easy)	36,762,408	3102	10,274	1,127	551	953	1,806	445	556	480	1,312	561	955	795	733
Qwen3-30B-A3B DeepSeek v3.1 Distill (FP8) (medium)	68,031,485	3803	15,498	2,074	2,468	576	1,657	699	458	1,376	1,457	573	955	1,967	1,238
Qwen3-30B-A3B DeepSeek v3.1 Distill (FP8) (hard)	71,725,328	4208	14,888	2,265	2,178	206	1,330	1,078	202	1,855	1,454	554	859	1,739	1,168
Apriel-1.5-15b-Thinker (FP16) (easy)	25,069,379	2100	9,969	685	233	644	2,160	569	813	444	1,168	533	1,092	817	811
Apriel-1.5-15b-Thinker (FP16) (medium)	50,003,151	2821	13,049	630	1,023	584	2,293	823	720	1,327	1,228	371	1,142	1,934	974
Apriel-1.5-15b-Thinker (FP16) (hard)	55,723,294	3374	11,467	338	992	273	2,177	1,232	359	1,756	970	228	1,140	1,382	620
QwQ 32B (AWQ) (easy)	36,511,724	2886	11,449	1,125	362	881	2,321	477	820	603	1,656	470	1,075	895	764
QwQ 32B (AWQ) (medium)	73,382,414	3672	17,020	1,546	1,390	643	2,128	732	995	1,651	2,346	505	1,072	2,926	1,086
QwQ 32B (AWQ) (hard)	77,741,526	4107	16,034	1,211	1,361	322	2,108	1,047	604	2,224	2,544	654	956	2,157	846
Apriel-Nemotron-1.5-15b-Thinker (FP16) (easy)	25,678,966	1494	14,915	2,410	871	1,980	2,238	542	949	448	1,753	764	1,233	959	768
Apriel-Nemotron-1.5-15b-Thinker (FP16) (medium)	54,062,782	1950	24,046	4,166	2,338	2,578	2,715	829	1,401	1,440	2,571	891	1,295	2,897	925
Apriel-Nemotron-1.5-15b-Thinker (FP16) (hard)	64,123,746	2255	24,958	4,455	2,443	1,926	2,932	1,274	1,475	1,919	2,488	956	1,295	3,082	713
aquif-3.5 8B (FP16) (easy)	24,717,514	1395	15,476	2,569	854	1,544	2,612	575	1,191	512	1,888	670	1,365	896	800
aquif-3.5 8B (FP16) (medium)	50,795,435	1775	25,363	4,480	2,972	2,133	2,912	894	1,480	1,408	2,613	923	1,427	3,226	895
aquif-3.5 8B (FP16) (hard)	59,288,272	2100	25,260	4,553	2,845	1,343	2,980	1,371	1,378	1,920	2,345	957	1,432	3,404	732
Qwen3-Next-80B-A3B Thinking (AWQ) (easy)	31,840,100	3453	7,954	918	318	476	1,604	413	541	480	987	152	949	701	415
Qwen3-Next-80B-A3B Thinking (AWQ) (medium)	59,204,594	4197	11,779	1,360	1,337	485	1,443	699	681	1,466	1,135	69	1,004	1,472	628
Qwen3-Next-80B-A3B Thinking (AWQ) (hard)	66,296,992	4698	11,285	1,099	1,245	274	1,442	1,046	513	1,941	964	48	887	1,139	687
Magistral Small 2509 (FP8) (easy)	29,523,753	1654	15,213	2,943	1,012	1,327	2,337	603	1,303	512	1,280	798	1,405	768	925
Magistral Small 2509 (FP8) (medium)	56,711,835	2066	24,662	4,609	3,185	1,343	2,811	952	1,515	1,504	2,206	1,084	1,437	2,747	1,269
Magistral Small 2509 (FP8) (hard)	63,443,463	2343	25,503	4,305	2,930	771	2,902	1,331	1,498	2,016	2,582	1,087	1,375	3,659	1,047
Llama-Nemotron-Super 49B v1.5 (INT8) (easy)	31,021,098	2606	10,043	1,553	356	639	1,609	511	860	543	1,213	276	1,029	765	689
Llama-Nemotron-Super 49B v1.5 (INT8) (medium)	61,560,487	3205	15,991	2,061	1,435	341	1,937	764	1,329	1,375	1,866	492	1,118	2,172	1,101
Llama-Nemotron-Super 49B v1.5 (INT8) (hard)	71,234,100	3640	16,576	1,546	1,338	122	2,176	1,143	1,452	1,854	2,180	550	1,264	2,026	925
GLM-4.5 Air (AWQ) (easy)	36,747,440	2636	11,721	1,791	699	943	2,110	478	604	448	1,632	685	1,001	730	600
GLM-4.5 Air (AWQ) (medium)	68,265,410	3153	18,440	2,589	2,695	875	2,369	733	382	1,375	1,433	1,046	982	2,920	1,041
GLM-4.5 Air (AWQ) (hard)	78,140,372	3548	19,174	2,188	2,502	477	2,674	1,082	216	1,854	2,013	1,106	941	2,977	1,144
Hunyuan A13B-Instruct (GPTQ) (easy)	24,537,772	1431	16,067	2,190	880	1,547	2,239	571	1,363	512	2,304	827	1,594	1,184	856
Hunyuan A13B-Instruct (GPTQ) (medium)	53,254,183	1884	26,138	4,027	2,918	2,282	2,748	857	1,400	1,632	2,816	1,057	1,528	3,838	1,035
Hunyuan A13B-Instruct (GPTQ) (hard)	62,584,085	2249	24,903	4,040	2,517	1,627	2,895	1,364	871	2,208	2,102	1,016	1,370	4,169	724
Qwen3-8B Original (FP16) (easy)	35,111,857	2590	12,258	1,677	756	474	2,235	603	909	512	1,950	498	1,050	831	763
Qwen3-8B Original (FP16) (medium)	76,998,713	3260	20,779	3,014	2,954	419	2,584	890	1,417	1,568	2,557	573	1,050	2,543	1,210
Qwen3-8B Original (FP16) (hard)	95,954,003	3621	23,081	3,449	3,462	284	2,538	1,366	979	2,432	2,268	540	1,148	3,603	1,012
Qwen3-30B-A3B Instruct-2507 (AWQ) (easy)	20,178,612	1353	12,385	1,593	673	1,091	1,920	605	1,182	544	1,664	729	881	831	672
Qwen3-30B-A3B Instruct-2507 (AWQ) (medium)	40,127,418	1754	19,707	3,050	2,274	762	2,103	987	1,498	1,504	2,399	942	879	2,639	670
Qwen3-30B-A3B Instruct-2507 (AWQ) (hard)	50,186,835	2116	21,435	3,734	2,337	338	2,094	1,462	1,498	2,016	2,841	970	815	2,765	565
Qwen3-4B Thinking-2507 (FP16) (easy)	49,757,627	4384	10,065	1,840	433	427	1,695	500	626	511	1,557	412	828	671	565
Qwen3-4B Thinking-2507 (FP16) (medium)	82,503,525	5073	14,051	2,681	1,827	263	1,567	751	372	1,405	1,656	163	823	1,740	803
Qwen3-4B Thinking-2507 (FP16) (hard)	83,141,988	5415	12,500	2,091	1,514	80	1,430	1,116	174	1,913	1,527	144	719	1,187	605
Qwen3-4B Original (FP16) (easy)	39,463,143	2472	14,124	2,181	1,068	778	2,523	634	1,202	544	1,599	550	1,213	1,010	822
Qwen3-4B Original (FP16) (medium)	78,796,516	3151	22,477	3,555	3,031	686	2,808	947	1,475	1,536	2,411	396	1,458	3,142	1,032
Qwen3-4B Original (FP16) (hard)	89,893,549	3569	22,641	3,324	2,995	396	2,841	1,451	1,117	2,080	2,312	414	1,395	3,532	784
Qwen3-4B Instruct-2507 (FP16) (easy)	25,086,642	1456	15,716	2,797	1,037	853	2,157	633	1,213	512	1,888	895	1,624	1,119	988
Qwen3-4B Instruct-2507 (FP16) (medium)	49,710,158	1897	24,892	4,658	3,248	627	2,380	1,013	1,530	1,503	2,559	1,117	1,682	3,407	1,168
Qwen3-4B Instruct-2507 (FP16) (hard)	58,408,997	2331	25,285	4,592	3,085	197	2,329	1,521	1,353	2,015	2,783	1,149	1,660	3,636	965
Qwen3-30B-A3B Thinking-2507 (AWQ) (easy)	43,260,618	3568	10,331	1,267	436	683	1,508	476	919	415	1,437	719	971	744	756
Qwen3-30B-A3B Thinking-2507 (AWQ) (medium)	74,154,340	4301	14,124	2,048	1,510	184	1,457	729	704	1,370	1,585	390	1,032	2,001	1,114
Qwen3-30B-A3B Thinking-2507 (AWQ) (hard)	76,194,307	4667	13,035	1,887	1,396	61	1,257	1,077	230	1,912	1,297	254	985	1,756	923
Nemotron Nano 9B v2 (FP16) (easy)	26,431,618	1504	15,925	2,858	1,060	1,023	2,413	604	1,244	480	2,015	535	1,472	1,342	879
Nemotron Nano 9B v2 (FP16) (medium)	54,470,408	1978	25,088	4,325	3,089	985	2,846	986	1,491	1,408	2,486	784	1,787	3,967	934
Nemotron Nano 9B v2 (FP16) (hard)	61,480,100	2339	24,116	3,991	2,779	423	3,006	1,462	1,176	1,920	1,930	847	1,723	4,254	605
Hermes-4 14B (FP8) (easy)	31,012,323	1870	13,529	2,525	582	504	1,659	533	1,196	575	1,887	830	1,428	1,036	774
Hermes-4 14B (FP8) (medium)	61,401,273	2304	21,928	4,293	1,985	351	2,005	911	1,437	1,657	2,743	1,037	1,413	3,331	765
Hermes-4 14B (FP8) (hard)	68,721,936	2612	21,213	4,287	1,802	240	1,890	1,384	1,184	2,262	2,456	1,040	1,283	2,879	506
R1-0528-Qwen3-8B (FP16) (easy)	49,778,494	3087	13,892	2,024	642	1,003	2,032	555	942	480	2,164	512	1,686	991	861
R1-0528-Qwen3-8B (FP16) (medium)	83,511,661	3615	19,983	2,790	2,300	630	2,211	929	264	1,504	2,302	731	1,968	3,538	816
R1-0528-Qwen3-8B (FP16) (hard)	83,385,639	3887	18,647	2,065	2,069	201	2,251	1,358	221	2,047	1,429	816	1,758	3,803	629
Hermes-4 70B (AWQ) (easy)	30,019,066	1784	10,623	2,788	544	114	953	526	1,135	471	1,203	367	1,023	1,052	447
Hermes-4 70B (AWQ) (medium)	56,247,853	2273	16,716	3,811	1,775	82	1,174	834	1,165	1,448	1,658	501	998	2,712	558
Hermes-4 70B (AWQ) (hard)	63,248,929	2641	16,660	3,169	1,723	36	1,188	1,171	811	1,949	1,786	624	973	2,640	590
Ring Mini 2.0 (FP16) (easy)	44,576,866	3616	10,421	1,005	645	556	1,761	580	839	544	1,568	309	849	1,245	520
Ring Mini 2.0 (FP16) (medium)	85,774,200	4359	16,565	883	2,076	488	2,369	1,169	195	1,663	1,939	307	866	3,899	711
Ring Mini 2.0 (FP16) (hard)	83,837,586	4669	14,318	469	1,800	201	2,434	1,321	89	2,239	1,119	323	665	3,157	501
aquif-3.5 A4B (FP16) (easy)	38,124,382	3009	10,701	1,168	293	271	1,952	536	1,279	576	1,875	137	1,068	987	559
aquif-3.5 A4B (FP16) (medium)	68,910,321	3783	13,960	1,211	943	103	1,560	852	699	1,597	2,399	59	1,055	2,953	529
aquif-3.5 A4B (FP16) (hard)	68,980,479	4161	11,774	771	909	29	1,113	1,349	231	2,141	1,933	45	1,005	1,846	402
Gemma3-27B-It (FP16) (easy)	5,951,843	355	15,232	2,616	1,150	1,112	2,553	602	704	416	2,176	1,023	1,024	1,280	576
Gemma3-27B-It (FP16) (medium)	12,422,760	447	22,572	3,394	2,311	1,341	2,872	981	508	1,472	2,784	895	1,184	4,224	606
Gemma3-27B-It (FP16) (hard)	14,841,446	514	22,619	2,716	2,112	966	2,909	1,393	497	1,952	2,688	831	1,120	4,830	605
Llama-3.3-70B (FP8) (easy)	5,523,294	370	13,680	2,560	1,088	598	2,464	605	798	416	1,536	1,024	959	960	672
Llama-3.3-70B (FP8) (medium)	13,435,170	486	21,961	3,326	3,296	597	2,944	1,019	509	1,440	2,368	896	1,023	3,839	704
Llama-3.3-70B (FP8) (hard)	16,966,749	571	21,991	2,878	3,168	376	2,880	1,432	415	1,920	2,624	832	1,087	3,739	640
Phi-4 (FP16) (easy)	6,417,477	391	15,279	2,654	1,371	761	2,688	637	832	448	2,176	992	1,184	800	736
Phi-4 (FP16) (medium)	13,220,775	502	22,693	3,646	3,546	694	2,944	1,083	511	1,472	2,592	1,056	1,246	3,200	703
Phi-4 (FP16) (hard)	15,621,380	597	22,273	2,907	3,296	597	2,944	1,560	416	1,952	1,824	992	1,310	3,904	571
Hunyuan 7B-Instruct (FP16) (easy)	29,344,139	1907	13,729	414	1,000	561	2,187	858	1,465	472	2,496	857	1,608	864	947
Hunyuan 7B-Instruct (FP16) (medium)	54,216,234	2663	19,652	321	3,127	199	2,625	1,303	1,131	1,464	2,588	1,074	1,666	3,251	903
Hunyuan 7B-Instruct (FP16) (hard)	61,320,090	2998	19,147	185	2,908	96	2,657	1,778	778	1,995	1,905	1,103	1,463	3,742	537
Gemma3-12B-It (FP16) (easy)	6,299,479	357	15,121	2,329	1,292	639	2,609	478	768	512	2,656	894	1,152	1,184	608
Gemma3-12B-It (FP16) (medium)	13,264,347	463	21,569	2,593	3,091	863	2,778	892	508	1,600	2,912	540	1,184	3,936	672
Gemma3-12B-It (FP16) (hard)	15,543,276	552	21,629	2,079	2,781	863	2,808	1,369	406	2,112	2,336	414	1,120	4,701	640
granite-4.0-h small (FP16) (easy)	6,522,847	304	15,751	2,229	967	724	2,688	796	701	576	2,463	959	1,024	1,952	672
granite-4.0-h small (FP16) (medium)	12,541,567	377	21,364	2,675	1,758	799	2,912	1,274	536	1,760	2,298	767	1,246	4,669	670
granite-4.0-h small (FP16) (hard)	14,055,779	432	18,541	1,761	1,681	716	2,976	1,783	334	2,301	1,331	736	1,310	3,016	596
Hunyuan 4B-Instruct (FP16) (easy)	28,908,510	1852	14,138	953	370	1,402	2,643	694	1,329	480	2,431	153	1,781	1,183	719
Hunyuan 4B-Instruct (FP16) (medium)	51,396,935	2502	18,281	626	738	1,523	2,764	1,107	913	1,503	2,525	95	1,889	3,995	603
Hunyuan 4B-Instruct (FP16) (hard)	56,421,969	2961	16,623	326	761	592	2,672	1,611	560	2,013	1,629	86	1,764	4,172	437
R1-Distill-Llama-8B (FP16) (easy)	34,274,431	1693	17,190	1,686	1,382	1,006	2,592	952	1,362	574	2,176	873	2,065	1,742	780
R1-Distill-Llama-8B (FP16) (medium)	59,928,466	2223	22,121	1,946	3,078	537	2,743	1,429	773	1,728	1,568	734	2,223	4,672	690
R1-Distill-Llama-8B (FP16) (hard)	59,798,754	2477	20,347	1,642	2,610	321	2,560	1,904	467	2,336	1,107	717	2,040	4,133	510
Qwen3-1.7B (AWQ) (easy)	46,900,595	2516	16,747	2,270	1,205	551	2,553	889	1,095	576	2,303	1,095	1,561	1,917	732
Qwen3-1.7B (AWQ) (medium)	74,092,131	3205	20,765	2,993	1,824	276	2,980	1,392	407	1,728	1,662	1,019	1,687	4,240	557
Qwen3-1.7B (AWQ) (hard)	76,506,761	3621	19,411	1,955	1,734	139	2,841	1,868	261	2,304	1,009	1,006	1,560	4,343	391
ERNIE-4.5-21B-A3B Thinking (AWQ) (easy)	51,617,179	3465	12,160	1,295	881	372	2,161	597	873	572	2,089	380	707	1,567	666
ERNIE-4.5-21B-A3B Thinking (AWQ) (medium)	80,978,436	4119	15,210	1,370	2,779	293	2,052	968	186	1,712	2,089	176	599	2,461	525
ERNIE-4.5-21B-A3B Thinking (AWQ) (hard)	76,649,834	4451	12,946	831	2,567	173	1,944	1,435	170	2,310	1,116	122	584	1,263	431
Phi-4 Mini Reasoning (FP16) (easy)	54,232,057	3078	14,642	2,385	1,096	88	2,524	717	1,275	625	2,472	297	1,143	1,351	669
Phi-4 Mini Reasoning (FP16) (medium)	80,513,695	3669	16,630	3,237	2,083	32	1,778	1,092	786	1,498	1,887	265	1,107	2,417	448
Phi-4 Mini Reasoning (FP16) (hard)	72,414,552	4035	12,947	2,346	1,914	19	1,062	1,564	440	2,015	1,004	286	1,052	965	280
SmolLM3 3B (FP16) (easy)	32,781,845	1633	14,538	2,104	1,069	253	1,272	968	1,282	565	1,907	831	1,662	1,914	711
SmolLM3 3B (FP16) (medium)	55,699,395	2086	18,762	2,540	2,244	196	1,674	1,449	772	1,772	1,284	928	1,787	3,601	515
SmolLM3 3B (FP16) (hard)	53,593,811	2388	16,147	1,916	2,062	90	1,678	1,907	548	2,378	867	893	1,699	1,805	304
Gemma3-4B-It (FP16) (easy)	6,895,460	348	15,059	2,027	877	667	2,639	795	672	608	2,619	447	1,184	2,015	509
Gemma3-4B-It (FP16) (medium)	12,829,122	472	20,453	2,067	1,968	844	2,996	1,294	384	1,824	2,431	287	1,280	4,640	438
Gemma3-4B-It (FP16) (hard)	13,098,828	547	18,632	1,520	1,889	839	2,871	1,803	448	2,432	1,568	319	1,344	3,200	399
granite-3.3 8B Instruct (FP16) (easy)	9,376,264	593	16,089	1,632	1,532	787	2,752	923	671	544	2,368	561	1,663	2,016	640
granite-3.3 8B Instruct (FP16) (medium)	15,560,912	719	20,396	1,501	3,376	767	2,848	1,428	445	1,728	1,632	308	1,662	4,126	575
granite-3.3 8B Instruct (FP16) (hard)	16,371,771	771	18,390	1,240	3,083	517	2,848	1,874	377	2,368	1,054	305	1,342	2,936	446
Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) (easy)	41,113,288	3125	9,604	1,902	199	165	757	954	1,038	542	2,210	155	686	653	343
Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) (medium)	62,633,572	3994	9,798	1,902	755	46	729	1,448	291	1,480	1,483	101	678	669	216
Llama-3.1-Nemotron-Nano-4B-v1.1 (FP16) (hard)	64,379,784	4491	8,493	1,088	776	13	596	1,918	144	2,011	765	68	700	267	147
Llama-3.1-Nemotron-Nano-8B (FP16) (easy)	52,725,257	1206	22,010	4,321	582	997	2,558	1,102	1,214	999	3,559	151	3,656	2,259	612
Llama-3.1-Nemotron-Nano-8B (FP16) (medium)	53,888,942	1563	18,322	3,565	2,042	699	2,121	814	448	1,414	1,937	106	2,125	2,802	249
Llama-3.1-Nemotron-Nano-8B (FP16) (hard)	51,786,393	1725	15,900	2,735	1,698	595	2,036	1,189	384	1,889	1,126	93	2,076	1,865	214
AI21 Jamba Reasoning 3B (FP16) (easy)	49,040,340	3090	11,600	1,299	608	451	2,158	811	500	410	1,714	540	1,229	1,509	371
AI21 Jamba Reasoning 3B (FP16) (medium)	76,612,547	3877	17,259	1,700	2,838	517	2,826	1,250	314	1,409	1,465	469	1,251	2,850	370
AI21 Jamba Reasoning 3B (FP16) (hard)	76,016,642	4237	16,735	1,381	2,876	395	2,943	1,754	286	2,035	993	488	1,288	1,943	353
granite-4.0-h tiny (FP16) (easy)	7,369,188	207	13,679	1,601	715	391	2,432	985	480	640	1,949	503	1,408	1,952	623
granite-4.0-h tiny (FP16) (medium)	15,630,823	266	14,540	1,184	1,443	195	2,783	1,492	383	1,568	1,055	309	1,566	2,079	483
granite-4.0-h tiny (FP16) (hard)	19,347,688	298	13,820	810	1,417	63	3,008	2,000	384	2,047	647	309	1,374	1,427	334
granite-4.0-h micro (FP16) (easy)	4,560,449	256	14,168	1,703	1,476	607	2,912	986	639	640	765	672	1,438	1,888	442
granite-4.0-h micro (FP16) (medium)	7,731,848	336	17,259	1,540	3,159	895	2,880	1,431	470	1,824	760	320	1,278	2,304	398
granite-4.0-h micro (FP16) (hard)	8,598,854	383	16,392	1,255	2,802	895	2,816	1,877	374	2,432	746	288	1,022	1,533	352
Llama-3.1-8B (FP16) (easy)	13,071,823	366	13,986	1,365	591	200	2,910	956	857	534	2,622	353	1,113	1,950	535
Llama-3.1-8B (FP16) (medium)	21,715,813	512	15,059	1,326	1,266	156	2,811	1,204	462	1,772	1,977	190	1,195	2,099	601
Llama-3.1-8B (FP16) (hard)	24,165,191	583	13,839	1,059	1,219	85	2,750	1,681	365	2,376	1,178	179	1,105	1,412	430
Llama-3.2-3B (FP16) (easy)	8,164,023	334	12,588	1,161	803	317	2,823	972	373	633	2,432	188	920	1,625	341
Llama-3.2-3B (FP16) (medium)	14,517,297	382	14,312	1,269	1,003	308	2,829	1,391	365	1,784	1,951	187	982	1,892	351
Llama-3.2-3B (FP16) (hard)	16,644,490	429	13,613	1,034	961	273	2,769	1,797	373	2,379	1,116	186	950	1,469	306
Phi-4 Mini Flash Reasoning (FP16) (easy)	31,116,228	1963	5,679	698	45	87	620	832	445	94	1,387	113	1,065	157	136

Overall Totals¶

Unique Models: 53

Total Tokens (All Models): 7,109,826,885

Total Tests (All Models): 2,646,483

Evaluation Workflows¶

Rapid Assessment (2-3 hours)¶

python runner.py --config configs/m12x.yaml --degree 0 --density normal --precision low

- Quick capability screening across all 12 domains - Comprehensive sampling with basic statistical confidence - Efficient resource utilization

Standard Evaluation (8-12 hours)¶

python runner.py --config configs/m12x.yaml --degree 1 --density normal --precision medium

- Comprehensive assessment with moderate difficulty - Balanced parameter space coverage - Publication-ready statistical rigor

Research-Grade Analysis (20+ hours)¶

python runner.py --config configs/m12x.yaml --degree 2 --density normal --precision high

- Maximum difficulty revealing failure modes - Complete parameter space exploration - Research-grade confidence intervals and comprehensive cognitive architecture analysis

Citation¶

When using M12X in research, please cite:

@software{reasonscape_m12x2025,
  title={M12X: Comprehensive 12-Domain Reasoning Evaluation},
  author={Mikhail Ravkine},
  year={2025},
  url={https://github.com/the-crypt-keeper/reasonscape},
  note={Part of ReasonScape evaluation methodology}
}