Add spelling benchmark for character-level evaluation#42
Open
zherendong wants to merge 9 commits intomainfrom
Open
Add spelling benchmark for character-level evaluation#42zherendong wants to merge 9 commits intomainfrom
zherendong wants to merge 9 commits intomainfrom
Conversation
- Add --max_samples_log CLI arg to eval_main.py to control sample truncation (default: 100, use 0 for unlimited) - Add analyze_results.py for stratified accuracy reporting by token count and task type (count/index/reverse) - Update README with analysis workflow documentation
…elling_benchmark a package
- Add tqdm progress bar to generate_until in lm_eval_wrapper.py - Add TaskManager with include_path for custom yaml tasks in eval_main.py - Fix spelling_bee task name to work with include_path
- Add --compare flag to analyze_results.py for comparing two models - Generate comparison table with baseline, comparison, and delta columns - Add KEY INSIGHTS section highlighting single-token vs multi-token gains - Update README with comparison workflow
- Add tqdm progress bar to loglikelihood method - Tracks samples processed (updates after each batch) - Helps estimate eval completion time for tasks like hellaswag, arc
- Use text ordinals (first, second) instead of numeric (1st, 2nd) for index task - Align count task format with training: "The number of times the letter X occurs in word is " - Add whitespace filter to handle leading spaces in model outputs - Disable request caching to avoid stale results - Regenerate dataset with updated formats Results: Spelling bee model shows +7pp improvement (8% -> 15%) - Count task: +10.3% (17.2% -> 27.6%) - Index task: +10.3% (7.7% -> 17.9%)
- Implement spelling benchmark with 3 task types: count, index, reverse - Add generate_data.py for synthetic dataset creation - Add spelling_bee.yaml for lm-eval harness integration - Add eval_main.py improvements for batch evaluation Results on 5000 samples (2450 count, 2450 index, 100 reverse): - Baseline: 11.2%, Spelling Bee: 21.4%, Delta: +10.2% - Count: +14.8%, Index: +5.9%, Reverse: 0% (both models)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a synthetic benchmark to evaluate character-level understanding in language models, specifically designed to test the effectiveness of spelling bee embeddings.
Task Types
The number of times the letter A occurs in banana is3Q: What is the third letter of the word 'banana'? A:ncat reversed istacResults
Evaluated on 816m parameter models (764m non-embedding):
Key findings:
Implementation
Dataset (n=5000)
num_tokens,is_single_tokenfor stratified analysisEvaluation
lm-evaluation-harnessviaspelling_bee.yamlFiles
spelling_benchmark/generate_data.pyspelling_benchmark/spelling_bee.yamlspelling_benchmark/spelling_bee.jsonlspelling_benchmark/analyze_results.pyspelling_benchmark/README.mdeval_main.py--max_samples_logparameter for full sample loggingUsage
Test plan
pytest spelling_benchmark/spelling_benchmark_test.py)