openvinotoolkit · dtrawins · May 13, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md
@@ -40,62 +40,16 @@ For every use case subcommand there is adjusted list of parameters:
 ```console
 python export_model.py text_generation --help
 ```
-Expected Output:
-```console
-usage: export_model.py text_generation [-h] [--model_repository_path MODEL_REPOSITORY_PATH] --source_model SOURCE_MODEL [--model_name MODEL_NAME] [--weight-format PRECISION] [--config_file_path CONFIG_FILE_PATH] [--overwrite_models] [--target_device TARGET_DEVICE] [--ov_cache_dir OV_CACHE_DIR]
-                                       [--extra_quantization_params EXTRA_QUANTIZATION_PARAMS] [--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}] [--kv_cache_precision {u8}] [--enable_prefix_caching ENABLE_PREFIX_CACHING] [--disable_dynamic_split_fuse] [--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS] [--max_num_seqs MAX_NUM_SEQS]
-                                       [--cache_size CACHE_SIZE] [--draft_source_model DRAFT_SOURCE_MODEL] [--draft_model_name DRAFT_MODEL_NAME] [--draft_eagle3_mode] [--max_prompt_len MAX_PROMPT_LEN] [--prompt_lookup_decoding] [--reasoning_parser {qwen3,gptoss}]
-                                       [--tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral,lfm2}] [--enable_tool_guided_generation]
 
-options:
-  -h, --help            show this help message and exit
-  --model_repository_path MODEL_REPOSITORY_PATH
-                        Where the model should be exported to
-  --source_model SOURCE_MODEL
-                        HF model name or path to the local folder with PyTorch or OpenVINO model
-  --model_name MODEL_NAME
-                        Model name that should be used in the deployment. Equal to source_model if HF model name is used
-  --weight-format PRECISION
-                        precision of the exported model
-  --config_file_path CONFIG_FILE_PATH
-                        path to the config file
-  --overwrite_models    Overwrite the model if it already exists in the models repository
-  --target_device TARGET_DEVICE
-                        CPU, GPU, NPU or HETERO, default is CPU
-  --ov_cache_dir OV_CACHE_DIR
-                        Folder path for compilation cache to speedup initialization time
-  --extra_quantization_params EXTRA_QUANTIZATION_PARAMS
-                        Add advanced quantization parameters. Check optimum-intel documentation. Example: "--sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2"
-  --pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}
-                        Type of the pipeline to be used. AUTO is used by default
-  --kv_cache_precision {u8}
-                        u8 or empty (model default). Reduced kv cache precision to u8 lowers the cache size consumption.
-  --enable_prefix_caching ENABLE_PREFIX_CACHING
-                        This algorithm is used to cache the prompt tokens. Default is True.
-  --disable_dynamic_split_fuse
-                        The maximum number of tokens that can be batched together.
-  --max_num_batched_tokens MAX_NUM_BATCHED_TOKENS
-                        empty or integer. The maximum number of tokens that can be batched together.
-  --max_num_seqs MAX_NUM_SEQS
-                        256 by default. The maximum number of sequences that can be processed together.
-  --cache_size CACHE_SIZE
-                        KV cache size in GB. If not set, cache is allocated dynamically.
-  --draft_source_model DRAFT_SOURCE_MODEL
-                        HF model name or path to the local folder with PyTorch or OpenVINO draft model. Using this option will create configuration for speculative decoding
-  --draft_model_name DRAFT_MODEL_NAME
-                        Draft model name that should be used in the deployment. Equal to draft_source_model if HF model name is used. Available only in draft_source_model has been specified.
-  --draft_eagle3_mode   Set this flag if you use EAGLE3 draft model for speculative decoding
-  --max_prompt_len MAX_PROMPT_LEN
-                        Sets NPU specific property for maximum number of tokens in the prompt. Not effective if target device is not NPU
-  --prompt_lookup_decoding
-                        Set pipeline to use prompt lookup decoding
-  --reasoning_parser {qwen3,gptoss}
-                        Set the type of the reasoning parser for reasoning content extraction
-  --tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral,lfm2}
-                        Set the type of the tool parser for tool calls extraction
-  --enable_tool_guided_generation
-                        Enables enforcing tool schema during generation. Requires setting tool_parser
-```
+> Note: Exporting some models might require different transformers version than specified in requirements.txt Check [supported models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/). If custom transformers version is required, install it afterwards via `pip install transformers==<version>`
+Some of the exceptions include:
+- Alibaba-NLP/gte-large-en-v1.5 - `transformers<5.0`
+- OpenGVLab/InternVL - `transformers<5.0`
+- Qwen3-80B-Next and Qwen3-coder-next - `transformers<5.0`
+- gemma4 - `transformers==5.5`
+- Qwen3.5 and Qwen3.6 - `transformers==5.2`
+
+
 
 ## Model Export Examples
 
@@ -152,7 +106,7 @@ python export_model.py embeddings_ov --source_model Qwen/Qwen3-Embedding-0.6B --
 #### Embeddings with `sentence_transformers` library
 Some embedding models require special handling during export. For example:
 ```console
-python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --extra_quantization_params "--library sentence_transformers" --weight-format fp16 --config_file_path models/config_all.json
+python export_model.py embeddings_ov --source_model nomic-ai/nomic-embed-text-v1.5 --extra_quantization_params "--library sentence_transformers" --pooling MEAN --weight-format fp16 --config_file_path models/config_all.json
 ```
 Known models that require it:
 - Alibaba-NLP/gte-large-en-v1.5

diff --git a/demos/common/export_models/requirements.txt b/demos/common/export_models/requirements.txt
@@ -14,3 +14,4 @@ sentence_transformers==5.3.0
 sentencepiece  # Required by: transformers`
 torchvision
 requests
+einops
diff --git a/demos/continuous_batching/accuracy/README.md b/demos/continuous_batching/accuracy/README.md
@@ -10,37 +10,21 @@ It reports end to end quality of served model from the client application point
 ## Preparing the lm-evaluation-harness framework 
 
 Install the framework via pip:
-```bash
+```text
 pip3 install --extra-index-url "https://download.pytorch.org/whl/cpu" lm_eval[api] langdetect immutabledict dotenv openai
 ```
 
-## Exporting the models
-```bash
-git clone https://github.com/openvinotoolkit/model_server.git
-cd model_server
-pip3 install -U -r demos/common/export_models/requirements.txt
-mkdir models 
-python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3.1-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
-python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3.1-8B --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
-python demos/common/export_models/export_model.py text_generation --source_model OpenGVLab/InternVL2_5-8B --weight-format fp16 --config_file_path models/config.json --model_repository_path models
-python demos/common/export_models/export_model.py text_generation --source_model Qwen/Qwen3-8B --model_name openvino-qwen3-8b-int8 --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 --overwrite_models
-```
-
 ## Starting the model server
 
-### With Docker
-```bash
-docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
-```
 
-### On Baremetal
-```bash
-ovms --rest_port 8000 --config_path ./models/config.json
-```
+Example of LLM and VLM models deployment is documented in other demos like
+[Agentic usage for LLM models](../agentic_ai/README.md) 
+[Using VLM models](../vlm/README.md)
+
 
 ## Running the tests for LLM models
 
-```bash
+```text
 lm-eval --model local-chat-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG  --log_samples --output_path test/ --seed 1 --apply_chat_template --limit 100
 
 local-chat-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'base_url': 'http://localhost:8000/v3/chat/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1
@@ -52,7 +36,7 @@ local-chat-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'base
 
 While testing the non chat model and `completion` endpoint, the command would look like this:
 
-```bash
+```text
 lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG  --log_samples --output_path results/ --seed 1 --limit 100
 
 local-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B', 'base_url': 'http://localhost:8000/v3/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1
@@ -64,28 +48,27 @@ local-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B', 'base_url': 'http:/
 
 Other examples are below:
 
-```bash
+```text
 lm-eval --model local-chat-completions --tasks leaderboard_ifeval --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100 --apply_chat_template  
 ```
 
-```bash
+```text
 lm-eval --model local-completions --tasks wikitext --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100
 ```
 
 ## Running the tests for VLM models
 
 Use [lmms-eval project](https://github.com/EvolvingLMMs-Lab/lmms-eval) - mme and mmmu_val tasks. 
 
-```bash
+```text
 export OPENAI_BASE_URL=http://localhost:8000/v3
 export OPENAI_API_KEY="unused"
 git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
 cd lmms-eval
-git checkout 88b23e2bfa16a1edbc16e9e238ed82130b3a4f56
 pip install -e . --extra-index-url "https://download.pytorch.org/whl/cpu"
 python -m lmms_eval \
     --model openai_compatible \
-    --model_args model_version=OpenGVLab/InternVL2_5-8B,max_retries=1 \
+    --model_args model_version=OpenVINO/InternVL2-8B_int4-ov,max_retries=1 \
     --tasks mme,mmmu_val \
     --batch_size 1 \
     --log_samples \
@@ -121,7 +104,7 @@ pip install -e . --extra-index-url "https://download.pytorch.org/whl/cpu"
 The commands below assumes the models is deployed with the name `ovms-model`. It must match the name set in the `bfcl_eval/constants/model_config.py`.
 ```text
 export OPENAI_BASE_URL=http://localhost:8000/v3
-export CHAT_TEMPLATE_KWARGS='{"enable_thinking":false, "reasoning_effort":"low"}'
+export CHAT_TEMPLATE_KWARGS='{"enable_thinking":false, "reasoning_effort":"low", "preserve_reasoning":false}'
 
 bfcl generate --model ovms-model --test-category simple_python,multiple --temperature 0.0 --num-threads 100 -o --result-dir model_name_dir
 bfcl evaluate --model ovms-model --result-dir model_name_dir 
@@ -130,15 +113,15 @@ bfcl evaluate --model ovms-model --result-dir model_name_dir
 Alternatively, use the model name `ovms-model-stream` to run the tests with stream requests. The results should be the same.
 ```text
 export OPENAI_BASE_URL=http://localhost:8000/v3
-bfcl generate --model ovms-model-stream --test-category simple_python,multiple --temperature 0.0 --num-threads 100 -o --result-dir model_name_dir
+bfcl generate --model ovms-model-stream --test-category simple_python,multiple,multi_turn_base --temperature 0.0 --num-threads 10 -o --result-dir model_name_dir
 bfcl evaluate --model ovms-model-stream --result-dir model_name_dir 
 ```
 
 **Analyzing results**
 The output artifacts will be stored in `result` and `scores`. For example:
 
 ```text
-cat score/openvino-qwen3-8b-int4-FC/BFCL_v3_simple_python_score.json | head -1
+cat score/openvino-qwen3-8b-int4-FC/BFCL_v4_simple_python_score.json | head -1
 {"accuracy": 0.95, "correct_count": 380, "total_count": 400}
 ```
 Those results can be compared with the reference from the [berkeley leaderbaord](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard).

diff --git a/demos/embeddings/README.md b/demos/embeddings/README.md
@@ -154,7 +154,8 @@ python export_model.py embeddings_ov --source_model intfloat/multilingual-e5-lar
 :::
 :::{tab-item} Alibaba-NLP/gte-large-en-v1.5
 :sync: gte-large-en-v1.5
-```console
+```text
+pip install "transformers<5"  # WA for optimum-intel and model support for new transformers
 python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --pooling CLS --extra_quantization_params "--library sentence_transformers"
 ```
 :::
@@ -225,7 +226,8 @@ python export_model.py embeddings_ov --source_model intfloat/multilingual-e5-lar
 :::
 :::{tab-item} Alibaba-NLP/gte-large-en-v1.5
 :sync: gte-large-en-v1.5
-```console
+```text
+pip install "transformers<5"  # WA for optimum-intel and model support for new transformers
 python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --pooling CLS --weight-format int8 --target_device GPU --extra_quantization_params "--library sentence_transformers" --config_file_path models/config.json --model_repository_path models
 ```
 :::