Skip to content

Add spelling benchmark for character-level evaluation#42

Open
zherendong wants to merge 9 commits intomainfrom
zheren/spelling-benchmark-2
Open

Add spelling benchmark for character-level evaluation#42
zherendong wants to merge 9 commits intomainfrom
zheren/spelling-benchmark-2

Conversation

@zherendong
Copy link
Copy Markdown
Collaborator

@zherendong zherendong commented Jan 22, 2026

Summary

Adds a synthetic benchmark to evaluate character-level understanding in language models, specifically designed to test the effectiveness of spelling bee embeddings.

Task Types

Task Format Example Output
Count The number of times the letter A occurs in banana is 3
Index Q: What is the third letter of the word 'banana'? A: n
Reverse cat reversed is tac

Results

Evaluated on 816m parameter models (764m non-embedding):

Task Baseline +Spelling Bee Delta
Count 12.6% 27.4% +14.8%
Index 10.3% 16.2% +5.9%
Reverse 0.0% 0.0% +0.0%
Overall 11.2% 21.4% +10.2%

Key findings:

  • Spelling bee embeddings nearly double overall accuracy on spelling tasks
  • Count task shows largest improvement (+14.8%), likely because counting requires aggregating across all character positions
  • Index task improves +5.9%, demonstrating position-encoded character access
  • Reverse task: 0% for both models (outputs forward spelling instead of reversed)

Implementation

Dataset (n=5000)

  • Distribution: 2,450 count / 2,450 index / 100 reverse
  • Word sources: 50/50 from Google 10k common words and 238k English word list
  • Filtering: 4-10 character words, palindromes excluded from reverse task
  • Tokenization metadata: num_tokens, is_single_token for stratified analysis

Evaluation

  • Few-shot prompting with 3 examples per task type
  • Integration with lm-evaluation-harness via spelling_bee.yaml
  • Generation-based evaluation with exact match (case-insensitive)

Files

File Description
spelling_benchmark/generate_data.py Dataset generation with task synthesis
spelling_benchmark/spelling_bee.yaml lm-eval harness task configuration
spelling_benchmark/spelling_bee.jsonl Generated benchmark (5000 samples)
spelling_benchmark/analyze_results.py Post-evaluation analysis with stratified metrics
spelling_benchmark/README.md Documentation
eval_main.py Added --max_samples_log parameter for full sample logging

Usage

from spelling_benchmark.generate_data import generate_dataset

generate_dataset(
    output_path='spelling_benchmark/spelling_bee.jsonl',
    num_samples=5000,
    seed=42,
    reverse_samples=100
)
python eval_main.py \
  --checkpoint_paths checkpoints/baseline.pt checkpoints/spelling_bee.pt \
  --tasks spelling_bee \
  --max_samples_log 0

Test plan

  • Unit tests pass (pytest spelling_benchmark/spelling_benchmark_test.py)
  • Evaluation runs successfully on baseline and spelling bee models
  • Results are reproducible (exact match on repeated runs with same seed)

- Add --max_samples_log CLI arg to eval_main.py to control sample truncation
  (default: 100, use 0 for unlimited)
- Add analyze_results.py for stratified accuracy reporting by token count
  and task type (count/index/reverse)
- Update README with analysis workflow documentation
- Add tqdm progress bar to generate_until in lm_eval_wrapper.py
- Add TaskManager with include_path for custom yaml tasks in eval_main.py
- Fix spelling_bee task name to work with include_path
- Add --compare flag to analyze_results.py for comparing two models
- Generate comparison table with baseline, comparison, and delta columns
- Add KEY INSIGHTS section highlighting single-token vs multi-token gains
- Update README with comparison workflow
- Add tqdm progress bar to loglikelihood method
- Tracks samples processed (updates after each batch)
- Helps estimate eval completion time for tasks like hellaswag, arc
- Use text ordinals (first, second) instead of numeric (1st, 2nd) for index task
- Align count task format with training: "The number of times the letter X occurs in word is "
- Add whitespace filter to handle leading spaces in model outputs
- Disable request caching to avoid stale results
- Regenerate dataset with updated formats

Results: Spelling bee model shows +7pp improvement (8% -> 15%)
- Count task: +10.3% (17.2% -> 27.6%)
- Index task: +10.3% (7.7% -> 17.9%)
- Implement spelling benchmark with 3 task types: count, index, reverse
- Add generate_data.py for synthetic dataset creation
- Add spelling_bee.yaml for lm-eval harness integration
- Add eval_main.py improvements for batch evaluation

Results on 5000 samples (2450 count, 2450 index, 100 reverse):
- Baseline: 11.2%, Spelling Bee: 21.4%, Delta: +10.2%
- Count: +14.8%, Index: +5.9%, Reverse: 0% (both models)
@zherendong zherendong requested a review from MarkusRabe January 22, 2026 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant