Conversation
Co-authored-by: Natalia Groza <natalia.groza@intel.com>
There was a problem hiding this comment.
Pull request overview
Updates the continuous batching accuracy demo documentation to reflect newer Llama 3.1 model IDs and to standardize the command examples for Linux shell usage.
Changes:
- Switches Meta-Llama model references from Llama 3 to Llama 3.1 in export and
lm-evalexamples. - Standardizes code fences to
bashand adds a note that steps were verified on Linux.
| lm-eval --model local-chat-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --apply_chat_template --limit 100 | ||
|
|
||
| local-chat-completions ({'model': 'meta-llama/Meta-Llama-3-8B-Instruct', 'base_url': 'http://localhost:8000/v3/chat/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1 | ||
| local-chat-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'base_url': 'http://localhost:8000/v3/chat/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1 |
There was a problem hiding this comment.
The example output shows num_concurrent: 10 while the command above sets num_concurrent=1. This is confusing for readers trying to reproduce the run; please make the command and the captured output consistent (either update the command args or regenerate/update the output snippet).
| ```console | ||
| lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path results/ --seed 1 --limit 100 | ||
| ```bash | ||
| lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path results/ --seed 1 --limit 100 |
There was a problem hiding this comment.
The example output shows num_concurrent: 10 while the command above sets num_concurrent=1. Please align the command and the captured output snippet so the documentation is reproducible.
| lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path results/ --seed 1 --limit 100 | |
| lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path results/ --seed 1 --limit 100 |
| The [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework provides a convenient method of evaluating the quality of the model exposed over OpenAI API. | ||
| It reports end to end quality of served model from the client application point of view. | ||
|
|
||
| **Note**: Below steps have been verified on Linux |
There was a problem hiding this comment.
Minor grammar/punctuation: consider adding a period and/or clarifying the scope (e.g., "verified on Linux only") so readers on other OSes know what to expect.
| **Note**: Below steps have been verified on Linux | |
| **Note:** The following steps have been verified on Linux only. |
No description provided.