Framework 1.1 Trial Results

Date: 2026-01-05 Framework Version: 1.1 Total Trials: ~15,000 across 17 models

Executive Summary

This document summarizes the results of LLM trials testing the learnability of Faber, a Latin-inspired programming language designed as an LLM-friendly intermediate representation.

Key Findings:

Grammar-only context outperforms prose documentation — Formal EBNF grammar yields 92-98% accuracy vs 81-88% for natural language descriptions
Coding models match frontier models — qwen3-coder (96%) rivals gpt-4o (98%) at 1/10th the cost
Prediction tasks measure different skills — Tasks requiring mental transpilation (predict_output) should be excluded from read/write competency metrics
Small models struggle regardless of context — Models under 8B parameters (llama-3.2-1b) remain below 20% accuracy

Methodology

Trial Configuration

Total Trials: 13,092
Models Tested: 17
Task Types: 53 unique tasks
N-shot Variations: 0, 1, 3, 10 examples
Context Types: 5 (examples-only, grammar-only, minimal, basic, complete)
Temperature: 0.0 (deterministic)
Dialect: Latin only

Models Tested

Category	Models
OpenAI	gpt-3.5-turbo, gpt-4o-mini, gpt-4o, gpt-5
Anthropic	claude-3-haiku, claude-3.5-sonnet, claude-4.5-sonnet
Meta	llama-3.1-8b, llama-3.1-70b, llama-3.2-1b, llama-3.2-3b
Coding-focused	codestral, deepseek-v3.1, mercury-coder, qwen3-coder, qwen2.5-coder-32b
Other	mistral-7b

Context Types

Context	Description
examples-only	No documentation, only n-shot examples
grammar-only	Formal EBNF grammar + keyword mappings
minimal	Brief keyword vocabulary list
basic	Quick reference with types, keywords, syntax rules
complete	Full grammar reference with all features

Task Categories

translate_ts_to_faber: Write Faber given TypeScript
translate_faber_to_ts: Write TypeScript given Faber
complete_code: Fill in missing Faber keyword
predict_output: Predict runtime output (excluded from primary metrics)

Overall Results

All Tasks (Including predict_output)

Model	Passed	Total	Accuracy
gpt-4o	761	840	91%
qwen3-coder	903	1008	90%
gpt-5	595	672	89%
gpt-4o-mini	736	840	88%
claude-3.5-sonnet	739	840	88%
llama-3.1-70b	726	840	86%
codestral	725	840	86%
deepseek-v3.1	710	840	85%
gpt-3.5-turbo	1252	1498	84%
claude-4.5-sonnet	520	672	77%
mercury-coder	614	840	73%
llama-3.1-8b	733	1009	73%
claude-3-haiku	638	925	69%
llama-3.2-1b	173	1093	16%

Note: mistral-7b (100%, n=18), llama-3.2-3b (40%, n=35), qwen2.5-coder-32b (0%, n=282) excluded due to incomplete runs

Read/Write Tasks Only (Excluding predict_output)

The predict_output tasks test mental transpilation (understanding that scribe(verum) outputs true), which is a different skill than read/write competency. Excluding these:

Model	Passed	Total	Accuracy
gpt-4o	559	600	93%
qwen3-coder	662	720	92%
deepseek-v3.1	550	600	92%
claude-3.5-sonnet	552	600	92%
gpt-5	435	480	91%
claude-4.5-sonnet	435	480	91%
codestral	537	600	90%
gpt-4o-mini	536	600	89%
llama-3.1-70b	528	600	88%
claude-3-haiku	581	661	88%
llama-3.1-8b	604	721	84%
gpt-3.5-turbo	882	1091	81%
mercury-coder	479	600	80%
llama-3.2-1b	118	787	15%

Observation: Top models cluster tightly at 88-93% when measuring actual code generation ability.

Results by Context Type

Excluding predict_output tasks:

Context	Passed	Total	Accuracy
grammar-only	1381	1590	87%
complete	1572	1817	87%
basic	1526	1801	85%
minimal	1846	2218	83%
examples-only	1162	1961	59%

Key Insight: Formal EBNF grammar (grammar-only) matches the most verbose documentation (complete) while using fewer tokens. Examples alone without any documentation performs poorly (59%).

Grammar-Only Context Analysis

The grammar-only context provides formal EBNF rules and keyword mappings. This section shows results for this context specifically, excluding predict_output tasks:

Model	Passed	Total	Accuracy
gpt-4o	117	120	98%
claude-3.5-sonnet	117	120	98%
qwen3-coder	231	240	96%
llama-3.1-70b	114	120	95%
deepseek-v3.1	114	120	95%
gpt-4o-mini	111	120	92%
codestral	111	120	92%
claude-3-haiku	108	120	90%
gpt-3.5-turbo	133	150	89%
mercury-coder	105	120	88%
llama-3.1-8b	103	120	86%
llama-3.2-1b	17	120	14%

Key Findings:

Frontier models (gpt-4o, claude-3.5-sonnet) achieve 98% accuracy
Coding-focused model qwen3-coder (96%) nearly matches at ~10x lower cost
Mid-tier models cluster at 86-92%
Only llama-3.2-1b (1B params) fails to learn effectively

Error Analysis

Excluding predict_output tasks:

Error Type	Count	% of Failures
type_error	1308	69%
no_response	236	12%
syntax_error	192	10%
wrong_output	152	8%
runtime_error	14	<1%

Interpretation: Most failures are type errors (likely TypeScript-style syntax like x: number instead of numerus x). Very few runtime errors, indicating models produce structurally valid code.

Task Difficulty Analysis

Hardest Tasks (Excluding predict_output)

Task	Pass Rate	Notes
translate_conditional	17%	Complex control flow
translate_function	17%	Function syntax
translate_if_else	17%	Control flow
complete_variable_declaration	25%	Type-first syntax
translate_array_literal	33%	Collection syntax
ts_to_faber_function_string	54%	Function + string return
ts_to_faber_boolean	56%	Boolean handling

Easiest Tasks

Task	Pass Rate	Notes
faber_to_ts_functio_string	95%	Reading Faber
faber_to_ts_arithmetic	94%	Reading Faber
faber_to_ts_si_true	93%	Reading Faber
faber_to_ts_ex_pro	93%	Reading Faber
faber_to_ts_functio	92%	Reading Faber

Key Pattern: Reading Faber (faber_to_ts) is significantly easier than writing Faber (ts_to_faber). Models can interpret Faber keywords but struggle to produce them correctly.

Cost Efficiency (Grammar-Only Context)

Model	Accuracy	Time	Cost	Cost per Correct
qwen3-coder	95%	179s	$0.03	$0.0002
deepseek-v3.1	88%	166s	$0.02	$0.0001
gpt-4o-mini	93%	119s	$0.02	$0.0001
codestral	90%	70s	$0.04	$0.0003
llama-3.1-70b	91%	180s	$0.05	$0.0003
gpt-3.5-turbo	92%	83s	$0.07	$0.0005
gpt-4o	95%	106s	$0.34	$0.0021
claude-3.5-sonnet	93%	289s	$0.49	$0.0031

Best Value: qwen3-coder and deepseek-v3.1 provide 88-95% accuracy at <$0.03 per 168-task run.

Fastest: codestral (70s) with 90% accuracy.

Highest Accuracy: gpt-4o and qwen3-coder tie at 95%, but qwen3-coder is 11x cheaper.

Conclusions

Primary Findings

Faber is learnable by LLMs: With proper context (grammar-only), 11 of 12 tested models achieve 86%+ accuracy on read/write tasks.
Formal grammar beats prose: EBNF grammar (87%) matches verbose documentation (87%) and outperforms minimal descriptions (83%). Models trained on code prefer structured specifications.
Reading > Writing: Models achieve 90-95% on faber_to_ts but only 54-65% on ts_to_faber. Generating novel Faber syntax is harder than interpreting it.
Coding models are cost-effective: qwen3-coder (96%) and deepseek-v3.1 (95%) match or exceed gpt-4o (98%) on grammar-only context at 10-15x lower cost.
predict_output tests different skills: These tasks measure mental transpilation, not read/write competency. Excluding them gives a clearer picture of syntax learning.

Recommendations for Future Trials

Remove or reclassify predict_output tasks — They test compilation semantics, not syntax competency.
Focus on ts_to_faber tasks — These are the hardest and most relevant for the "LLM drafts Faber" workflow.
Use grammar-only context as default — It's compact, effective, and preferred by coding models.
Add Faber-English ablation — To test whether Latin keywords specifically help, or just the regular structure.
Add multi-pass refinement — Test whether self-correction improves accuracy on hard tasks.

Appendix: Trial Run Summary

Total cost across all trials: ~$15-20 Total time: ~4 hours wall clock (parallel runs) Framework version: 1.1 Date: 2026-01-05