Pre-alpha: Published for early evaluation only

Thesis and Trials

Can LLMs learn Faber effectively? The faber-romanus project includes an evaluation harness to test this systematically.

Hypothesis

Faber's design choices - Latin vocabulary, regular morphology, consistent syntax - should make it easier for LLMs to learn from few examples compared to languages optimized for human ergonomics.

Evaluation

The trials test:

Trial Results

Metric Value
Framework version 1.1
Total evaluations 13,270
Models tested 15
Total cost $12.04
Total tokens 9.5M in / 563K out
Total time 18980.8s

Model Comparison: Cost vs Speed vs Accuracy

Model Accuracy Avg Latency Cost Tokens
gpt-4o 89% 829ms $1.94 707K
qwen3-coder 89% 1.4s $0.22 926K
gpt-3.5-turbo 89% 521ms $0.40 762K
gpt-5 89% 6.7s $4.37 584K
gpt-4o-mini 88% 869ms $0.10 630K
claude-3.5-sonnet 88% 1.8s $2.21 667K
llama-3.1-70b 86% 1.1s $0.21 609K
codestral 86% 541ms $0.24 737K
deepseek-v3.1 85% 2.0s $0.10 617K
claude-4.5-sonnet 77% 1.5s $1.74 518K
mercury-coder 73% 589ms $0.22 834K
llama-3.1-8b 73% 915ms $0.04 717K
claude-3-haiku 70% 970ms $0.22 769K
llama-3.2-1b 15% 486ms $0.03 778K
qwen2.5-coder-32b 0% 7.2s $0.02 253K

Three-Level Grading Breakdown

A = typechecks, B = runs without error, C = correct output.

Model Tests A (Typechecks) B (Runs) C (Correct)
gpt-4o 952 93% 93% 88%
qwen3-coder 1068 94% 94% 89%
gpt-3.5-turbo 1166 91% 91% 89%
gpt-5 672 93% 93% 89%
gpt-4o-mini 895 93% 93% 88%
claude-3.5-sonnet 840 95% 95% 88%
llama-3.1-70b 870 91% 91% 86%
codestral 964 93% 92% 86%
deepseek-v3.1 862 95% 94% 85%
claude-4.5-sonnet 672 93% 93% 77%
mercury-coder 840 76% 76% 73%
llama-3.1-8b 1063 90% 90% 73%
claude-3-haiku 946 92% 92% 70%
llama-3.2-1b 1138 43% 43% 15%
qwen2.5-coder-32b 282 29% 29% 0%

By Context Level

How much documentation context helps models learn Faber.

Context Tests Accuracy
examples-only 2681 61%
grammar-only 2662 82%
minimal 2985 77%
basic 2466 79%
complete 2476 79%

By N-shot (Learning Curve)

Effect of few-shot examples on accuracy.

Examples Tests Accuracy
0-shot 3343 70%
1-shot 3073 71%
3-shot 3914 80%
10-shot 2940 81%

Error Distribution

Where failures occur (among failed trials only).

Error Type Count % of Failures
type_error 1360 42%
wrong_output 1345 42%
no_response 312 10%
syntax_error 201 6%
runtime_error 14 0%

By Task

Task Tests Accuracy
faber_to_ts_functio_string 305 95%
faber_to_ts_arithmetic 303 94%
faber_to_ts_ex_pro 304 93%
faber_to_ts_si_true 305 92%
faber_to_ts_functio 305 92%
complex_ts_to_faber_factorial 12 92%
complex_ts_to_faber_fibonacci 12 92%
complex_ts_to_faber_multi_function 12 92%
complex_ts_to_faber_ternary_chain 12 92%
complex_ts_to_faber_string_ops 12 92%
complex_ts_to_faber_early_return 12 92%
complex_ts_to_faber_accumulator 12 92%
complex_ts_to_faber_prime_check 12 92%
faber_to_ts_fixum 304 91%
faber_to_ts_string 305 91%
faber_to_ts_si_false 305 91%
faber_to_ts_varia 305 90%
faber_to_ts_dum 303 89%
predict_const_value 303 87%
faber_to_ts_boolean 303 85%
ts_to_faber_const 333 84%
complex_ts_to_faber_if_in_loop 12 83%
complex_ts_to_faber_typed_params 12 83%
complex_ts_to_faber_find_max 12 83%
ts_to_faber_string 332 82%
ts_to_faber_arithmetic 331 82%
complete_const_keyword 302 81%
ts_to_faber_let 331 80%
complete_return_keyword 302 79%
complete_let_keyword 302 79%
ts_to_faber_if_false 332 78%
complete_function_keyword 301 78%
complete_while_keyword 301 78%
ts_to_faber_if_true 332 77%
predict_simple_output 303 77%
complete_print_keyword 300 77%
ts_to_faber_while 331 76%
predict_function_math 302 76%
predict_arithmetic_parens 302 75%
predict_loop_sum 302 75%
complex_ts_to_faber_fizzbuzz 12 75%
complete_else_keyword 301 74%
predict_conditional_true 303 73%
complete_loop_keyword 302 72%
predict_conditional_false 304 71%
ts_to_faber_for_of 332 67%
predict_arithmetic_precedence 303 67%
complex_ts_to_faber_array_type 12 67%
ts_to_faber_function 332 65%
predict_loop_output 302 65%
predict_function_call 302 65%
ts_to_faber_boolean 331 59%
complex_ts_to_faber_guard_clause 12 58%
ts_to_faber_function_string 332 57%
complex_ts_to_faber_loop_in_loop 14 43%
complex_ts_to_faber_nested_if 14 29%
predict_boolean_and 302 16%
predict_boolean_or 301 13%
complex_ts_to_faber_higher_order 12 0%
complex_ts_to_faber_gcd 12 0%
complex_ts_to_faber_binary_search 14 0%

Methodology

See faber-romanus for raw data and methodology details.