Skip to content

[OMNIML-2850] [3/n] Adds sparse attention calibration#538

Merged
kaix-nv merged 11 commits intomainfrom
kaix/sparse_attention_calibration
Feb 18, 2026
Merged

[OMNIML-2850] [3/n] Adds sparse attention calibration#538
kaix-nv merged 11 commits intomainfrom
kaix/sparse_attention_calibration

Conversation

@kaix-nv
Copy link
Contributor

@kaix-nv kaix-nv commented Nov 11, 2025

What does this PR do?

Type of change: ?
new feature

Overview: ?

  • This PR adds the sparse attention calibration algorithm
  • Chunked prefill to support long ctx_len
  • Separated calibration for prefill and decode

Usage

import modelopt.torch.sparsity.attention_sparsity as mtsa

# Apply sparse attention with calibration
model = mtsa.sparsify(model, config=SKIP_SOFTMAX_CALIB)

# Print summary - now shows actual thresholds
mtsa.print_sparse_attention_summary(model)
# Output:
# Method: flash_skip_softmax, Threshold: Dynamic (λ=437.395926)

# Or llm_eval integration
# HuggingFace sparse attention example
python examples/llm_sparsity/attention_sparsity/hf_sa.py \
    --pyt_ckpt_path Qwen/Qwen3-4B \
    --sparse_attn skip_softmax_calib 

The calibration method

Calibration Algorithm

  • Implemented the Inverse Power model: scale_factor = k / (1 - sparsity)^p
  • Fit model parameters (k, p) per phase using scipy.optimize.curve_fit
  • At inference: threshold = k / (1 - target_sparsity)^p / seqlen

Why Choosing the Inverse Power model?

The inverse power model better fits the relationship between sparsity ratio and threshold_scale_factor.
sparsity_model_analysis

Runtime Flexibility

  • Target sparsity can be changed at inference time without recalibration
  • Users can adjust module._sparse_method_instance.target_sparse_ratio dynamically
  • Threshold automatically adapts to sequence length

Testing

The calibration results for Qwen/Qwen3-30B-A3B-Thinking-2507 are shown below and are mostly consistent with the ground-truth numbers collected from the kernel side.

Prefill Calibration Results:
  Model: scale_factor = k / (1 - sparsity)^p
  Fitted k: 1003.3990
  Fitted p: 1.2589
  R-squared: 0.827549

Scale factors for different target sparsities:
  Target     Scale Factor
  ---------- ---------------
  50%        2401.35
  70%        4568.26
  80%        7610.98
  90%        18214.70
  95%        43591.65

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

@kaix-nv kaix-nv requested review from a team as code owners November 11, 2025 22:38
@kaix-nv kaix-nv requested review from RalphMao and removed request for RalphMao November 11, 2025 22:38
@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 68.67196% with 276 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.54%. Comparing base (3801923) to head (179e8dd).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...arsity/attention_sparsity/calibration/calibrate.py 26.24% 104 Missing ⚠️
...rsity/attention_sparsity/calibration/calibrator.py 44.00% 70 Missing ⚠️
...ty/attention_sparsity/calibration/ruler_dataset.py 84.90% 53 Missing ⚠️
...pt/torch/sparsity/attention_sparsity/conversion.py 59.45% 30 Missing ⚠️
...ch/sparsity/attention_sparsity/sparse_attention.py 63.15% 7 Missing ⚠️
...delopt/torch/sparsity/attention_sparsity/config.py 90.76% 6 Missing ⚠️
...y/attention_sparsity/methods/flash_skip_softmax.py 90.00% 5 Missing ⚠️
...ch/sparsity/attention_sparsity/methods/registry.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #538      +/-   ##
==========================================
- Coverage   73.74%   73.54%   -0.21%     
==========================================
  Files         199      205       +6     
  Lines       21183    22000     +817     
==========================================
+ Hits        15621    16179     +558     
- Misses       5562     5821     +259     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 8c7ee86 to da6f627 Compare November 12, 2025 00:17
@kaix-nv kaix-nv changed the title [3/n] Adds sparse attention integration to the llm_eval examples [OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples Nov 12, 2025
@kaix-nv kaix-nv changed the title [OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples [OMNIML-2850][3/n] Adds sparse attention integration to the llm_eval examples Nov 12, 2025
@kaix-nv kaix-nv changed the title [OMNIML-2850][3/n] Adds sparse attention integration to the llm_eval examples [OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples Nov 12, 2025
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch 4 times, most recently from 525a119 to c9d7008 Compare November 13, 2025 07:40
@kaix-nv kaix-nv changed the title [OMNIML-2850] [3/n] Adds sparse attention integration to the llm_eval examples [OMNIML-2850] [3/n] Adds sparse attention calibration; Adds llm_eval support Nov 14, 2025
@kaix-nv kaix-nv changed the title [OMNIML-2850] [3/n] Adds sparse attention calibration; Adds llm_eval support [OMNIML-2850] [3/n] Adds sparse attention calibration Nov 14, 2025
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch 5 times, most recently from 7727793 to 2864629 Compare December 1, 2025 11:35
@kaix-nv kaix-nv requested a review from a team as a code owner December 1, 2025 11:35
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 2864629 to ca7e24e Compare December 1, 2025 15:19
@kaix-nv kaix-nv removed the request for review from kevalmorabia97 December 1, 2025 15:25
@kevalmorabia97
Copy link
Collaborator

kevalmorabia97 commented Dec 1, 2025

@kaix-nv github is showing 7000+ lines of code as part of this PR. Is that accurate?
It shouldn’t be that much. Less than half of the code should remain after rebasing on the preceding PR.

@kaix-nv kaix-nv requested a review from jy-yuan December 8, 2025 21:52
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch 4 times, most recently from 3474b6f to 74a29ea Compare December 13, 2025 21:00
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch 3 times, most recently from 0553ec6 to a5136e8 Compare January 31, 2026 01:37
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 5cd6149 to 4b7efca Compare February 10, 2026 00:20
@kaix-nv kaix-nv enabled auto-merge (squash) February 10, 2026 00:21
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 4b7efca to 5b22b85 Compare February 10, 2026 00:23
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does running pytest tests/gpu/torch/sparsity take?

Comment on lines +18 to +21
import pytest

pytest.importorskip("transformers")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kaix-nv
Copy link
Contributor Author

kaix-nv commented Feb 12, 2026

@kevalmorabia97 All feedback has been addressed. Please take another look. Thanks.

@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 9a7ae2a to 7529d30 Compare February 12, 2026 22:53
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Otherwise LGTM. Thanks for addressing my comments

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this file with ruler_utils.py and name ruler_dataset.py? Both are specific for ruler dataset only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 7529d30 to 2e3059b Compare February 18, 2026 00:23
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_calibration branch from 2e3059b to 179e8dd Compare February 18, 2026 00:34
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add codeowner approval.

@kaix-nv Will the VSA support be your next PR? =)

@kaix-nv kaix-nv merged commit 9e38041 into main Feb 18, 2026
37 checks passed
@kaix-nv kaix-nv deleted the kaix/sparse_attention_calibration branch February 18, 2026 02:18
@kaix-nv
Copy link
Contributor Author

kaix-nv commented Feb 18, 2026

Add codeowner approval.

@kaix-nv Will the VSA support be your next PR? =)

Yes, the VSA PR will be submitted soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants