[OMNIML-2850] [3/n] Adds sparse attention calibration by kaix-nv · Pull Request #538 · NVIDIA/Model-Optimizer

kaix-nv · 2025-11-11T22:38:15Z

What does this PR do?

Type of change: ?
new feature

Overview: ?

This PR adds the sparse attention calibration algorithm
Chunked prefill to support long ctx_len
Separated calibration for prefill and decode

Usage

import modelopt.torch.sparsity.attention_sparsity as mtsa

# Apply sparse attention with calibration
model = mtsa.sparsify(model, config=SKIP_SOFTMAX_CALIB)

# Print summary - now shows actual thresholds
mtsa.print_sparse_attention_summary(model)
# Output:
# Method: flash_skip_softmax, Threshold: Dynamic (λ=437.395926)

# Or llm_eval integration
# HuggingFace sparse attention example
python examples/llm_sparsity/attention_sparsity/hf_sa.py \
    --pyt_ckpt_path Qwen/Qwen3-4B \
    --sparse_attn skip_softmax_calib

The calibration method

Calibration Algorithm

Implemented the Inverse Power model: scale_factor = k / (1 - sparsity)^p
Fit model parameters (k, p) per phase using scipy.optimize.curve_fit
At inference: threshold = k / (1 - target_sparsity)^p / seqlen

Why Choosing the Inverse Power model?

The inverse power model better fits the relationship between sparsity ratio and threshold_scale_factor.

Runtime Flexibility

Target sparsity can be changed at inference time without recalibration
Users can adjust module._sparse_method_instance.target_sparse_ratio dynamically
Threshold automatically adapts to sequence length

Testing

The calibration results for Qwen/Qwen3-30B-A3B-Thinking-2507 are shown below and are mostly consistent with the ground-truth numbers collected from the kernel side.

Prefill Calibration Results:
  Model: scale_factor = k / (1 - sparsity)^p
  Fitted k: 1003.3990
  Fitted p: 1.2589
  R-squared: 0.827549

Scale factors for different target sparsities:
  Target     Scale Factor
  ---------- ---------------
  50%        2401.35
  70%        4568.26
  80%        7610.98
  90%        18214.70
  95%        43591.65

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

codecov · 2025-11-11T22:52:03Z

Codecov Report

❌ Patch coverage is 68.67196% with 276 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.54%. Comparing base (3801923) to head (179e8dd).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...arsity/attention_sparsity/calibration/calibrate.py	26.24%	104 Missing ⚠️
...rsity/attention_sparsity/calibration/calibrator.py	44.00%	70 Missing ⚠️
...ty/attention_sparsity/calibration/ruler_dataset.py	84.90%	53 Missing ⚠️
...pt/torch/sparsity/attention_sparsity/conversion.py	59.45%	30 Missing ⚠️
...ch/sparsity/attention_sparsity/sparse_attention.py	63.15%	7 Missing ⚠️
...delopt/torch/sparsity/attention_sparsity/config.py	90.76%	6 Missing ⚠️
...y/attention_sparsity/methods/flash_skip_softmax.py	90.00%	5 Missing ⚠️
...ch/sparsity/attention_sparsity/methods/registry.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #538      +/-   ##
==========================================
- Coverage   73.74%   73.54%   -0.21%     
==========================================
  Files         199      205       +6     
  Lines       21183    22000     +817     
==========================================
+ Hits        15621    16179     +558     
- Misses       5562     5821     +259

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 · 2025-12-01T17:27:31Z

@kaix-nv github is showing 7000+ lines of code as part of this PR. Is that accurate?
It shouldn’t be that much. Less than half of the code should remain after rebasing on the preceding PR.

setup.py

examples/llm_sparsity/attention_sparsity/README.md

modelopt/torch/sparsity/attention_sparsity/calibration/calibrate.py

kevalmorabia97 · 2026-02-11T23:38:39Z

tests/gpu/torch/sparsity/attention_sparsity/test_calibration_gpu.py

How long does running pytest tests/gpu/torch/sparsity take?

tests/unit/torch/sparsity/attention_sparsity/test_flash_skip_softmax.py

tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_calibration.py

kevalmorabia97 · 2026-02-11T23:41:12Z

tests/unit/torch/sparsity/attention_sparsity/test_stats_manager.py

+import pytest
+
+pytest.importorskip("transformers")
+


tests/unit/torch/sparsity/attention_sparsity/test_threshold_info.py

kaix-nv · 2026-02-12T07:52:16Z

@kevalmorabia97 All feedback has been addressed. Please take another look. Thanks.

kevalmorabia97

Some minor comments. Otherwise LGTM. Thanks for addressing my comments

modelopt/torch/sparsity/attention_sparsity/calibration/calibrate.py

kevalmorabia97 · 2026-02-14T00:21:00Z

modelopt/torch/sparsity/attention_sparsity/calibration/ruler_dataset.py

Can we merge this file with ruler_utils.py and name ruler_dataset.py? Both are specific for ruler dataset only

modelopt/torch/sparsity/attention_sparsity/conversion.py

Signed-off-by: Kai Xu <kaix@nvidia.com>

Edwardf0t1

Add codeowner approval.

@kaix-nv Will the VSA support be your next PR? =)

kaix-nv · 2026-02-18T02:51:16Z

Add codeowner approval.

@kaix-nv Will the VSA support be your next PR? =)

Yes, the VSA PR will be submitted soon.

kaix-nv requested review from a team as code owners November 11, 2025 22:38

kaix-nv requested review from RalphMao and removed request for RalphMao November 11, 2025 22:38

kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 8c7ee86 to da6f627 Compare November 12, 2025 00:17

kaix-nv changed the title ~~[3/n] Adds sparse attention integration to the llm_eval examples~~ [OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples Nov 12, 2025

kaix-nv changed the title ~~[OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples~~ [OMNIML-2850][3/n] Adds sparse attention integration to the llm_eval examples Nov 12, 2025

kaix-nv changed the title ~~[OMNIML-2850][3/n] Adds sparse attention integration to the llm_eval examples~~ [OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples Nov 12, 2025

kaix-nv force-pushed the kaix/sparse_attention_calibration branch 4 times, most recently from 525a119 to c9d7008 Compare November 13, 2025 07:40

kaix-nv changed the title ~~[OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples~~ [OMNIML-2850] [3/n] Adds sparse attention calibration; Adds llm_eval support Nov 14, 2025

kaix-nv changed the title ~~[OMNIML-2850] [3/n] Adds sparse attention calibration; Adds llm_eval support~~ [OMNIML-2850] [3/n] Adds sparse attention calibration Nov 14, 2025

kaix-nv force-pushed the kaix/sparse_attention_calibration branch 5 times, most recently from 7727793 to 2864629 Compare December 1, 2025 11:35

kaix-nv requested a review from a team as a code owner December 1, 2025 11:35

kaix-nv requested a review from kevalmorabia97 December 1, 2025 11:35

kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 2864629 to ca7e24e Compare December 1, 2025 15:19

kaix-nv removed the request for review from kevalmorabia97 December 1, 2025 15:25

kaix-nv requested a review from jy-yuan December 8, 2025 21:52

kaix-nv force-pushed the kaix/sparse_attention_calibration branch 4 times, most recently from 3474b6f to 74a29ea Compare December 13, 2025 21:00

kaix-nv force-pushed the kaix/sparse_attention_calibration branch 3 times, most recently from 0553ec6 to a5136e8 Compare January 31, 2026 01:37

kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 5cd6149 to 4b7efca Compare February 10, 2026 00:20

kaix-nv enabled auto-merge (squash) February 10, 2026 00:21

kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 4b7efca to 5b22b85 Compare February 10, 2026 00:23

rohansjoshi approved these changes Feb 10, 2026

View reviewed changes

kevalmorabia97 reviewed Feb 11, 2026

View reviewed changes

setup.py Outdated Show resolved Hide resolved

kevalmorabia97 reviewed Feb 11, 2026

View reviewed changes

setup.py Outdated Show resolved Hide resolved

kevalmorabia97 reviewed Feb 11, 2026

View reviewed changes

kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 9a7ae2a to 7529d30 Compare February 12, 2026 22:53

kevalmorabia97 approved these changes Feb 14, 2026

View reviewed changes

kaix-nv added 10 commits February 17, 2026 16:23

Add sparse attention integration to llm_eval

d0fee1f

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add sparse attention calibration for the decode phase

93f3d53

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add hf unified checkpoint export for sparse attention

ded8b75

Signed-off-by: Kai Xu <kaix@nvidia.com>

Address feedbacks

3fa5014

Signed-off-by: Kai Xu <kaix@nvidia.com>

update default threshold_trials

26aa1d0

Signed-off-by: Kai Xu <kaix@nvidia.com>

Update sparse attention config

91729a9

Signed-off-by: Kai Xu <kaix@nvidia.com>

Move the data folder under example

e915ff0

Signed-off-by: Kai Xu <kaix@nvidia.com>

Implement Inverse Power calibration for sparse attention

d8e5d44

Signed-off-by: Kai Xu <kaix@nvidia.com>

Switch to exponential model for fitting from inverse power

4c9e1bd

Signed-off-by: Kai Xu <kaix@nvidia.com>

Update CHANGELOG

031ae91

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 7529d30 to 2e3059b Compare February 18, 2026 00:23

Address review feedbacks

179e8dd

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 2e3059b to 179e8dd Compare February 18, 2026 00:34

Edwardf0t1 approved these changes Feb 18, 2026

View reviewed changes

kaix-nv merged commit 9e38041 into main Feb 18, 2026
37 checks passed

kaix-nv deleted the kaix/sparse_attention_calibration branch February 18, 2026 02:18

Conversation

kaix-nv commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

The calibration method

Calibration Algorithm

Why Choosing the Inverse Power model?

Runtime Flexibility

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

codecov bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 commented Dec 1, 2025 • edited by kaix-nv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaix-nv commented Feb 12, 2026

Uh oh!

kevalmorabia97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaix-nv commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kaix-nv commented Nov 11, 2025 •

edited

Loading

codecov bot commented Nov 11, 2025 •

edited

Loading

kevalmorabia97 commented Dec 1, 2025 •

edited by kaix-nv

Loading