NPUSlim

NPU-oriented model compression & quantization framework for large language models.

Features

Post-training quantization: INT8Dynamic, GPTQ, QuIP, SparseGPT
Model support: Qwen3, OPT
vLLM-ascend deployment integration
Performance evaluation with lm-eval and evalscope

Installation

pip install -e .

Requires CANN environment with ASCEND_HOME_PATH set.

Quick Start

Quantization

# Use mirror if HuggingFace is inaccessible
export HF_ENDPOINT="https://hf-mirror.com"

# INT8 dynamic
python tools/run.py -c configs/opt/int8_dynamic/opt_125m-w8a8.yaml

# GPTQ
python tools/run.py -c configs/opt/gptq/opt_125m-w4a16.yaml

Deployment (vLLM)

bash tools/serve/deploy_vllm.sh outputs/opt/int8_dynamic/opt_125m-w8a8 -d 0 -t 1

Evaluation

LM-Eval Harness (supports 3 backends: vllm, hf, api):

# vLLM backend (fastest, direct loading - no server needed)
bash tools/eval/run_lmeval.sh outputs/model --backend vllm --tasks wikitext -d 0

# HuggingFace backend
bash tools/eval/run_lmeval.sh outputs/model --backend hf --tasks wikitext -d 0

# API backend (requires running server)
bash tools/serve/deploy_vllm.sh outputs/model -d 0 -t 1
bash tools/eval/run_lmeval.sh outputs/model --backend api --tasks wikitext

Stress Test (requires running vLLM server):

# Step 1: Deploy vLLM server first
bash tools/serve/deploy_vllm.sh outputs/model -d 0 -t 1

# Step 2: Run stress test against running server
bash tools/eval/run_stress_test.sh outputs/model

Tool Scripts

Script	Description
`tools/serve/deploy_vllm.sh`	Deploy vLLM inference server
`tools/eval/run_lmeval.sh`	Run lm-evaluation-harness (backends: vllm, hf, api)
`tools/eval/run_stress_test.sh`	Run stress test via API (requires running server)

Common Options

Server deployment options:

-d, --devices - Device IDs (e.g., 0,1 or 4,5)
-t, --tp - Tensor parallel size
--gpu-memory - GPU memory utilization (default: 0.8)
--max-model-len - Max model length (default: 4096)
-q, --quantization - Quantization method (auto-detected on NPU)

LM-Eval options:

--backend - Backend type: vllm, hf, or api (default: vllm)
--tasks - Comma-separated benchmark tasks (default: wikitext)
--limit - Limit number of samples per task
--log-samples - Save model outputs for debugging

Use --help to see all options for each script.

Configuration

Edit config files in configs/<model>/<algo>/ to customize model path, quantization parameters, and pipeline tasks.

model:
  model_path: your/model/path

pipeline:
  - type: ptq
    algo_name: INT8Dynamic

Architecture

src/npuslim/slim_engine.py - Orchestrator managing resources and task pipeline
src/npuslim/utils/factory.py - Factory pattern for models, datasets, tasks, compressors
src/npuslim/compressor/quantizer/ - Quantization algorithms (INT8Dynamic, GPTQ, QuIP, SparseGPT)
src/npuslim/vllm_plugin/ - vLLM-ascend integration
tools/utils/common.sh - Shared bash utilities (logging, device detection)

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
configs		configs
dataset/sharegpt_gpt4_qwen		dataset/sharegpt_gpt4_qwen
docs		docs
src		src
tests		tests
tools		tools
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NPUSlim

Features

Installation

Quick Start

Quantization

Deployment (vLLM)

Evaluation

Tool Scripts

Common Options

Configuration

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NPUSlim

Features

Installation

Quick Start

Quantization

Deployment (vLLM)

Evaluation

Tool Scripts

Common Options

Configuration

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages