Thesis and Trials

Can LLMs learn Faber effectively? The faber-romanus project includes an evaluation harness to test this systematically.

Hypothesis

Faber's design choices - Latin vocabulary, regular morphology, consistent syntax - should make it easier for LLMs to learn from few examples compared to languages optimized for human ergonomics.

Evaluation

The trials test:

Multiple models - From GPT-3.5 to Llama 3.2 1B
N-shot learning - 0, 1, 3, and 10 example configurations
Task types - Translation, completion, prediction, explanation
Context levels - From examples-only to complete documentation

Trial Results

Metric	Value
Framework version	1.1
Total evaluations	13,270
Models tested	15
Total cost	$12.04
Total tokens	9.5M in / 563K out
Total time	18980.8s

Model Comparison: Cost vs Speed vs Accuracy

Model	Accuracy	Avg Latency	Cost	Tokens
gpt-4o	89%	829ms	$1.94	707K
qwen3-coder	89%	1.4s	$0.22	926K
gpt-3.5-turbo	89%	521ms	$0.40	762K
gpt-5	89%	6.7s	$4.37	584K
gpt-4o-mini	88%	869ms	$0.10	630K
claude-3.5-sonnet	88%	1.8s	$2.21	667K
llama-3.1-70b	86%	1.1s	$0.21	609K
codestral	86%	541ms	$0.24	737K
deepseek-v3.1	85%	2.0s	$0.10	617K
claude-4.5-sonnet	77%	1.5s	$1.74	518K
mercury-coder	73%	589ms	$0.22	834K
llama-3.1-8b	73%	915ms	$0.04	717K
claude-3-haiku	70%	970ms	$0.22	769K
llama-3.2-1b	15%	486ms	$0.03	778K
qwen2.5-coder-32b	0%	7.2s	$0.02	253K

Three-Level Grading Breakdown

A = typechecks, B = runs without error, C = correct output.

Model	Tests	A (Typechecks)	B (Runs)	C (Correct)
gpt-4o	952	93%	93%	88%
qwen3-coder	1068	94%	94%	89%
gpt-3.5-turbo	1166	91%	91%	89%
gpt-5	672	93%	93%	89%
gpt-4o-mini	895	93%	93%	88%
claude-3.5-sonnet	840	95%	95%	88%
llama-3.1-70b	870	91%	91%	86%
codestral	964	93%	92%	86%
deepseek-v3.1	862	95%	94%	85%
claude-4.5-sonnet	672	93%	93%	77%
mercury-coder	840	76%	76%	73%
llama-3.1-8b	1063	90%	90%	73%
claude-3-haiku	946	92%	92%	70%
llama-3.2-1b	1138	43%	43%	15%
qwen2.5-coder-32b	282	29%	29%	0%

By Context Level

How much documentation context helps models learn Faber.

Context	Tests	Accuracy
examples-only	2681	61%
grammar-only	2662	82%
minimal	2985	77%
basic	2466	79%
complete	2476	79%

By N-shot (Learning Curve)

Effect of few-shot examples on accuracy.

Examples	Tests	Accuracy
0-shot	3343	70%
1-shot	3073	71%
3-shot	3914	80%
10-shot	2940	81%

Error Distribution

Where failures occur (among failed trials only).

Error Type	Count	% of Failures
type_error	1360	42%
wrong_output	1345	42%
no_response	312	10%
syntax_error	201	6%
runtime_error	14	0%

By Task

Task	Tests	Accuracy
faber_to_ts_functio_string	305	95%
faber_to_ts_arithmetic	303	94%
faber_to_ts_ex_pro	304	93%
faber_to_ts_si_true	305	92%
faber_to_ts_functio	305	92%
complex_ts_to_faber_factorial	12	92%
complex_ts_to_faber_fibonacci	12	92%
complex_ts_to_faber_multi_function	12	92%
complex_ts_to_faber_ternary_chain	12	92%
complex_ts_to_faber_string_ops	12	92%
complex_ts_to_faber_early_return	12	92%
complex_ts_to_faber_accumulator	12	92%
complex_ts_to_faber_prime_check	12	92%
faber_to_ts_fixum	304	91%
faber_to_ts_string	305	91%
faber_to_ts_si_false	305	91%
faber_to_ts_varia	305	90%
faber_to_ts_dum	303	89%
predict_const_value	303	87%
faber_to_ts_boolean	303	85%
ts_to_faber_const	333	84%
complex_ts_to_faber_if_in_loop	12	83%
complex_ts_to_faber_typed_params	12	83%
complex_ts_to_faber_find_max	12	83%
ts_to_faber_string	332	82%
ts_to_faber_arithmetic	331	82%
complete_const_keyword	302	81%
ts_to_faber_let	331	80%
complete_return_keyword	302	79%
complete_let_keyword	302	79%
ts_to_faber_if_false	332	78%
complete_function_keyword	301	78%
complete_while_keyword	301	78%
ts_to_faber_if_true	332	77%
predict_simple_output	303	77%
complete_print_keyword	300	77%
ts_to_faber_while	331	76%
predict_function_math	302	76%
predict_arithmetic_parens	302	75%
predict_loop_sum	302	75%
complex_ts_to_faber_fizzbuzz	12	75%
complete_else_keyword	301	74%
predict_conditional_true	303	73%
complete_loop_keyword	302	72%
predict_conditional_false	304	71%
ts_to_faber_for_of	332	67%
predict_arithmetic_precedence	303	67%
complex_ts_to_faber_array_type	12	67%
ts_to_faber_function	332	65%
predict_loop_output	302	65%
predict_function_call	302	65%
ts_to_faber_boolean	331	59%
complex_ts_to_faber_guard_clause	12	58%
ts_to_faber_function_string	332	57%
complex_ts_to_faber_loop_in_loop	14	43%
complex_ts_to_faber_nested_if	14	29%
predict_boolean_and	302	16%
predict_boolean_or	301	13%
complex_ts_to_faber_higher_order	12	0%
complex_ts_to_faber_gcd	12	0%
complex_ts_to_faber_binary_search	14	0%

Methodology

Temperature: 0.0 (deterministic)
Seed: 42 (reproducible)
Dialects: Latin keywords
Context levels: examples-only, minimal, basic, complete

See faber-romanus for raw data and methodology details.