Add spelling benchmark for character-level evaluation by zherendong · Pull Request #42 · MarkusRabe/littletrainingloop

zherendong · 2026-01-22T06:06:57Z

Summary

Adds a synthetic benchmark to evaluate character-level understanding in language models, specifically designed to test the effectiveness of spelling bee embeddings.

Task Types

Task	Format	Example Output
Count	`The number of times the letter A occurs in banana is`	`3`
Index	`Q: What is the third letter of the word 'banana'? A:`	`n`
Reverse	`cat reversed is`	`tac`

Results

Evaluated on 816m parameter models (764m non-embedding):

Task	Baseline	+Spelling Bee	Delta
Count	12.6%	27.4%	+14.8%
Index	10.3%	16.2%	+5.9%
Reverse	0.0%	0.0%	+0.0%
Overall	11.2%	21.4%	+10.2%

Key findings:

Spelling bee embeddings nearly double overall accuracy on spelling tasks
Count task shows largest improvement (+14.8%), likely because counting requires aggregating across all character positions
Index task improves +5.9%, demonstrating position-encoded character access
Reverse task: 0% for both models (outputs forward spelling instead of reversed)

Implementation

Dataset (n=5000)

Distribution: 2,450 count / 2,450 index / 100 reverse
Word sources: 50/50 from Google 10k common words and 238k English word list
Filtering: 4-10 character words, palindromes excluded from reverse task
Tokenization metadata: num_tokens, is_single_token for stratified analysis

Evaluation

Few-shot prompting with 3 examples per task type
Integration with lm-evaluation-harness via spelling_bee.yaml
Generation-based evaluation with exact match (case-insensitive)

Files

File	Description
`spelling_benchmark/generate_data.py`	Dataset generation with task synthesis
`spelling_benchmark/spelling_bee.yaml`	lm-eval harness task configuration
`spelling_benchmark/spelling_bee.jsonl`	Generated benchmark (5000 samples)
`spelling_benchmark/analyze_results.py`	Post-evaluation analysis with stratified metrics
`spelling_benchmark/README.md`	Documentation
`eval_main.py`	Added `--max_samples_log` parameter for full sample logging

Usage

from spelling_benchmark.generate_data import generate_dataset

generate_dataset(
    output_path='spelling_benchmark/spelling_bee.jsonl',
    num_samples=5000,
    seed=42,
    reverse_samples=100
)

python eval_main.py \
  --checkpoint_paths checkpoints/baseline.pt checkpoints/spelling_bee.pt \
  --tasks spelling_bee \
  --max_samples_log 0

Test plan

Unit tests pass (pytest spelling_benchmark/spelling_benchmark_test.py)
Evaluation runs successfully on baseline and spelling bee models
Results are reproducible (exact match on repeated runs with same seed)

…d sampling

- Add --max_samples_log CLI arg to eval_main.py to control sample truncation (default: 100, use 0 for unlimited) - Add analyze_results.py for stratified accuracy reporting by token count and task type (count/index/reverse) - Update README with analysis workflow documentation

…elling_benchmark a package

- Add tqdm progress bar to generate_until in lm_eval_wrapper.py - Add TaskManager with include_path for custom yaml tasks in eval_main.py - Fix spelling_bee task name to work with include_path

- Add --compare flag to analyze_results.py for comparing two models - Generate comparison table with baseline, comparison, and delta columns - Add KEY INSIGHTS section highlighting single-token vs multi-token gains - Update README with comparison workflow

- Add tqdm progress bar to loglikelihood method - Tracks samples processed (updates after each batch) - Helps estimate eval completion time for tasks like hellaswag, arc

- Use text ordinals (first, second) instead of numeric (1st, 2nd) for index task - Align count task format with training: "The number of times the letter X occurs in word is " - Add whitespace filter to handle leading spaces in model outputs - Disable request caching to avoid stale results - Regenerate dataset with updated formats Results: Spelling bee model shows +7pp improvement (8% -> 15%) - Count task: +10.3% (17.2% -> 27.6%) - Index task: +10.3% (7.7% -> 17.9%)

- Implement spelling benchmark with 3 task types: count, index, reverse - Add generate_data.py for synthetic dataset creation - Add spelling_bee.yaml for lm-eval harness integration - Add eval_main.py improvements for batch evaluation Results on 5000 samples (2450 count, 2450 index, 100 reverse): - Baseline: 11.2%, Spelling Bee: 21.4%, Delta: +10.2% - Count: +14.8%, Index: +5.9%, Reverse: 0% (both models)

zherendong added 9 commits January 19, 2026 15:22

Add synthetic spelling benchmark (count, index, reverse tasks)

c928f2b

Add tokenization metadata (is_single_token) and 50/50 common/rare wor…

8b66bc8

…d sampling

Refactor: separate runner script, add local word list assets, make sp…

abc8889

…elling_benchmark a package

Add progress bar for generate_until and fix spelling_bee task loading

777ec37

- Add tqdm progress bar to generate_until in lm_eval_wrapper.py - Add TaskManager with include_path for custom yaml tasks in eval_main.py - Fix spelling_bee task name to work with include_path

Add progress bar to loglikelihood for eval progress visibility

0bc2e4a

- Add tqdm progress bar to loglikelihood method - Tracks samples processed (updates after each batch) - Helps estimate eval completion time for tasks like hellaswag, arc

zherendong requested a review from MarkusRabe January 22, 2026 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spelling benchmark for character-level evaluation#42

Add spelling benchmark for character-level evaluation#42
zherendong wants to merge 9 commits intomainfrom
zheren/spelling-benchmark-2

zherendong commented Jan 22, 2026 •

edited by MarkusRabe

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zherendong commented Jan 22, 2026 • edited by MarkusRabe Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Task Types

Results

Implementation

Dataset (n=5000)

Evaluation

Files

Usage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zherendong commented Jan 22, 2026 •

edited by MarkusRabe

Loading