Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 10 additions & 56 deletions demos/common/export_models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,62 +40,16 @@ For every use case subcommand there is adjusted list of parameters:
```console
python export_model.py text_generation --help
```
Expected Output:
```console
usage: export_model.py text_generation [-h] [--model_repository_path MODEL_REPOSITORY_PATH] --source_model SOURCE_MODEL [--model_name MODEL_NAME] [--weight-format PRECISION] [--config_file_path CONFIG_FILE_PATH] [--overwrite_models] [--target_device TARGET_DEVICE] [--ov_cache_dir OV_CACHE_DIR]
[--extra_quantization_params EXTRA_QUANTIZATION_PARAMS] [--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}] [--kv_cache_precision {u8}] [--enable_prefix_caching ENABLE_PREFIX_CACHING] [--disable_dynamic_split_fuse] [--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS] [--max_num_seqs MAX_NUM_SEQS]
[--cache_size CACHE_SIZE] [--draft_source_model DRAFT_SOURCE_MODEL] [--draft_model_name DRAFT_MODEL_NAME] [--draft_eagle3_mode] [--max_prompt_len MAX_PROMPT_LEN] [--prompt_lookup_decoding] [--reasoning_parser {qwen3,gptoss}]
[--tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral,lfm2}] [--enable_tool_guided_generation]

options:
-h, --help show this help message and exit
--model_repository_path MODEL_REPOSITORY_PATH
Where the model should be exported to
--source_model SOURCE_MODEL
HF model name or path to the local folder with PyTorch or OpenVINO model
--model_name MODEL_NAME
Model name that should be used in the deployment. Equal to source_model if HF model name is used
--weight-format PRECISION
precision of the exported model
--config_file_path CONFIG_FILE_PATH
path to the config file
--overwrite_models Overwrite the model if it already exists in the models repository
--target_device TARGET_DEVICE
CPU, GPU, NPU or HETERO, default is CPU
--ov_cache_dir OV_CACHE_DIR
Folder path for compilation cache to speedup initialization time
--extra_quantization_params EXTRA_QUANTIZATION_PARAMS
Add advanced quantization parameters. Check optimum-intel documentation. Example: "--sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2"
--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}
Type of the pipeline to be used. AUTO is used by default
--kv_cache_precision {u8}
u8 or empty (model default). Reduced kv cache precision to u8 lowers the cache size consumption.
--enable_prefix_caching ENABLE_PREFIX_CACHING
This algorithm is used to cache the prompt tokens. Default is True.
--disable_dynamic_split_fuse
The maximum number of tokens that can be batched together.
--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS
empty or integer. The maximum number of tokens that can be batched together.
--max_num_seqs MAX_NUM_SEQS
256 by default. The maximum number of sequences that can be processed together.
--cache_size CACHE_SIZE
KV cache size in GB. If not set, cache is allocated dynamically.
--draft_source_model DRAFT_SOURCE_MODEL
HF model name or path to the local folder with PyTorch or OpenVINO draft model. Using this option will create configuration for speculative decoding
--draft_model_name DRAFT_MODEL_NAME
Draft model name that should be used in the deployment. Equal to draft_source_model if HF model name is used. Available only in draft_source_model has been specified.
--draft_eagle3_mode Set this flag if you use EAGLE3 draft model for speculative decoding
--max_prompt_len MAX_PROMPT_LEN
Sets NPU specific property for maximum number of tokens in the prompt. Not effective if target device is not NPU
--prompt_lookup_decoding
Set pipeline to use prompt lookup decoding
--reasoning_parser {qwen3,gptoss}
Set the type of the reasoning parser for reasoning content extraction
--tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral,lfm2}
Set the type of the tool parser for tool calls extraction
--enable_tool_guided_generation
Enables enforcing tool schema during generation. Requires setting tool_parser
```
> Note: Exporting some models might require different transformers version than specified in requirements.txt Check [supported models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/). If custom transformers version is required, install it afterwards via `pip install transformers==<version>`
Some of the exceptions include:
- Alibaba-NLP/gte-large-en-v1.5 - `transformers<5.0`
- OpenGVLab/InternVL - `transformers<5.0`
- Qwen3-80B-Next and Qwen3-coder-next - `transformers<5.0`
- gemma4 - `transformers==5.5`
- Qwen3.5 and Qwen3.6 - `transformers==5.2`



## Model Export Examples

Expand Down Expand Up @@ -152,7 +106,7 @@ python export_model.py embeddings_ov --source_model Qwen/Qwen3-Embedding-0.6B --
#### Embeddings with `sentence_transformers` library
Some embedding models require special handling during export. For example:
```console
python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --extra_quantization_params "--library sentence_transformers" --weight-format fp16 --config_file_path models/config_all.json
python export_model.py embeddings_ov --source_model nomic-ai/nomic-embed-text-v1.5 --extra_quantization_params "--library sentence_transformers" --pooling MEAN --weight-format fp16 --config_file_path models/config_all.json
```
Known models that require it:
- Alibaba-NLP/gte-large-en-v1.5
Expand Down
1 change: 1 addition & 0 deletions demos/common/export_models/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ sentence_transformers==5.3.0
sentencepiece # Required by: transformers`
torchvision
requests
einops
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alibaba model still wasn't exported:
python3 export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --extra_quantization_params "--library sentence_transformers" --weight-format fp16 --config_file_path models/config_all.json

RuntimeError: Couldn't get TorchScript module by tracing.
Exception:
index 2314885530818453536 is out of bounds for dimension 0 with size 16
Please check correctness of provided 'example_input'. Sometimes models can be converted in scripted mode, please try running conversion without 'example_input'.
You can also provide TorchScript module that you obtained yourself, please refer to PyTorch documentation: https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html.
Traceback (most recent call last):
File "/opt/home/k8sworker/ngroza/test/model_server/demos/common/export_models/export_model.py", line 687, in
export_embeddings_model_ov(args['model_repository_path'], args['source_model'], args['model_name'], args['precision'], template_parameters, args['config_file_path'], args['truncate'])
File "/opt/home/k8sworker/ngroza/test/model_server/demos/common/export_models/export_model.py", line 520, in export_embeddings_model_ov
raise ValueError("Failed to export embeddings model", source_model)
ValueError: ('Failed to export embeddings model', 'Alibaba-NLP/gte-large-en-v1.5')

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is one of the models that require transformers<5

45 changes: 14 additions & 31 deletions demos/continuous_batching/accuracy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,37 +10,21 @@ It reports end to end quality of served model from the client application point
## Preparing the lm-evaluation-harness framework

Install the framework via pip:
```bash
```text
pip3 install --extra-index-url "https://download.pytorch.org/whl/cpu" lm_eval[api] langdetect immutabledict dotenv openai
```

## Exporting the models
```bash
git clone https://github.com/openvinotoolkit/model_server.git
cd model_server
pip3 install -U -r demos/common/export_models/requirements.txt
mkdir models
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3.1-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3.1-8B --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
python demos/common/export_models/export_model.py text_generation --source_model OpenGVLab/InternVL2_5-8B --weight-format fp16 --config_file_path models/config.json --model_repository_path models
python demos/common/export_models/export_model.py text_generation --source_model Qwen/Qwen3-8B --model_name openvino-qwen3-8b-int8 --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 --overwrite_models
```

## Starting the model server

### With Docker
```bash
docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
```

### On Baremetal
```bash
ovms --rest_port 8000 --config_path ./models/config.json
```
Example of LLM and VLM models deployment is documented in other demos like
[Agentic usage for LLM models](../agentic_ai/README.md)
[Using VLM models](../vlm/README.md)
Comment on lines 17 to +22


## Running the tests for LLM models

```bash
```text
lm-eval --model local-chat-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --apply_chat_template --limit 100

local-chat-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'base_url': 'http://localhost:8000/v3/chat/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1
Expand All @@ -52,7 +36,7 @@ local-chat-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'base

While testing the non chat model and `completion` endpoint, the command would look like this:

```bash
```text
lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path results/ --seed 1 --limit 100

local-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B', 'base_url': 'http://localhost:8000/v3/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1
Expand All @@ -64,28 +48,27 @@ local-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B', 'base_url': 'http:/

Other examples are below:

```bash
```text
lm-eval --model local-chat-completions --tasks leaderboard_ifeval --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100 --apply_chat_template
```

```bash
```text
lm-eval --model local-completions --tasks wikitext --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100
```

## Running the tests for VLM models

Use [lmms-eval project](https://github.com/EvolvingLMMs-Lab/lmms-eval) - mme and mmmu_val tasks.

```bash
```text
export OPENAI_BASE_URL=http://localhost:8000/v3
export OPENAI_API_KEY="unused"
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
git checkout 88b23e2bfa16a1edbc16e9e238ed82130b3a4f56
pip install -e . --extra-index-url "https://download.pytorch.org/whl/cpu"
python -m lmms_eval \
--model openai_compatible \
--model_args model_version=OpenGVLab/InternVL2_5-8B,max_retries=1 \
--model_args model_version=OpenVINO/InternVL2-8B_int4-ov,max_retries=1 \
--tasks mme,mmmu_val \
--batch_size 1 \
Comment on lines 69 to 73
--log_samples \
Expand Down Expand Up @@ -121,7 +104,7 @@ pip install -e . --extra-index-url "https://download.pytorch.org/whl/cpu"
The commands below assumes the models is deployed with the name `ovms-model`. It must match the name set in the `bfcl_eval/constants/model_config.py`.
```text
export OPENAI_BASE_URL=http://localhost:8000/v3
export CHAT_TEMPLATE_KWARGS='{"enable_thinking":false, "reasoning_effort":"low"}'
export CHAT_TEMPLATE_KWARGS='{"enable_thinking":false, "reasoning_effort":"low", "preserve_reasoning":false}'

bfcl generate --model ovms-model --test-category simple_python,multiple --temperature 0.0 --num-threads 100 -o --result-dir model_name_dir
bfcl evaluate --model ovms-model --result-dir model_name_dir
Expand All @@ -130,15 +113,15 @@ bfcl evaluate --model ovms-model --result-dir model_name_dir
Alternatively, use the model name `ovms-model-stream` to run the tests with stream requests. The results should be the same.
```text
export OPENAI_BASE_URL=http://localhost:8000/v3
bfcl generate --model ovms-model-stream --test-category simple_python,multiple --temperature 0.0 --num-threads 100 -o --result-dir model_name_dir
bfcl generate --model ovms-model-stream --test-category simple_python,multiple,multi_turn_base --temperature 0.0 --num-threads 10 -o --result-dir model_name_dir
bfcl evaluate --model ovms-model-stream --result-dir model_name_dir
```

**Analyzing results**
The output artifacts will be stored in `result` and `scores`. For example:

```text
cat score/openvino-qwen3-8b-int4-FC/BFCL_v3_simple_python_score.json | head -1
cat score/openvino-qwen3-8b-int4-FC/BFCL_v4_simple_python_score.json | head -1
{"accuracy": 0.95, "correct_count": 380, "total_count": 400}
```
Those results can be compared with the reference from the [berkeley leaderbaord](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard).
Expand Down
6 changes: 4 additions & 2 deletions demos/embeddings/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,8 @@ python export_model.py embeddings_ov --source_model intfloat/multilingual-e5-lar
:::
:::{tab-item} Alibaba-NLP/gte-large-en-v1.5
:sync: gte-large-en-v1.5
```console
```text
pip install "transformers<5" # WA for optimum-intel and model support for new transformers
python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --pooling CLS --extra_quantization_params "--library sentence_transformers"
```
:::
Expand Down Expand Up @@ -225,7 +226,8 @@ python export_model.py embeddings_ov --source_model intfloat/multilingual-e5-lar
:::
:::{tab-item} Alibaba-NLP/gte-large-en-v1.5
:sync: gte-large-en-v1.5
```console
```text
pip install "transformers<5" # WA for optimum-intel and model support for new transformers
python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --pooling CLS --weight-format int8 --target_device GPU --extra_quantization_params "--library sentence_transformers" --config_file_path models/config.json --model_repository_path models
```
:::
Expand Down