Fix LLaVA packed sequence cu_seqlens after image token expansion#3427
Closed
aroshanghias-nvd wants to merge 3 commits intoNVIDIA:mainfrom
Closed
Fix LLaVA packed sequence cu_seqlens after image token expansion#3427aroshanghias-nvd wants to merge 3 commits intoNVIDIA:mainfrom
aroshanghias-nvd wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
When sequence packing is used with LLaVA, PackedSeqParams.cu_seqlens
boundaries arrive in collapsed space (1 token per image), but
_preprocess_data expands each image placeholder into img_seq_len
embedding tokens. The cu_seqlens were never updated to reflect this
expansion, causing attention kernels to see incorrect sub-sequence
boundaries, corrupting attention masking, RoPE positions, and CP/SP
sharding for every packed sub-sequence.
Add _remap_packed_seq_params_after_image_expansion() to LLaVAModel
which translates all cu_seqlens entries from collapsed to expanded
space using the same cumulative-sum position mapping as
_preprocess_data. Call it in forward() right after _preprocess_data
and before the language model.
Add test_forward_packed_vs_batched parameterized over num_samples=
{1,4,16} that compares batched-unpacked vs packed forward passes
using token_mult_prob_error (same metric as NeMo-RL). Without the
fix: 1.10-1.13. With the fix: 1.000000.
Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
added 2 commits
February 13, 2026 17:14
Remove unnecessary intermediate variable _image_token_index and pass the expression directly to _preprocess_data for cleaner code. Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>
…overage The cleanup commit c836295 removed the _image_token_index local variable but forgot to update its usage in the _remap_packed_seq_params call at line 1002, causing a NameError on any packed-sequence forward pass with images. Restore the local variable so both _preprocess_data and _remap use it. Also fix test_forward's cu_seqlens which exceeded the collapsed input length (previously unreachable), and enhance test_forward_packed_vs_batched with variable tiles per image ([1,2,1,3,1] cycle) and multi-image samples. Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>
Author
|
Let's drop this idea for now. It turns out that Megatron-LM is handling CP and sequence packing alright, because it uses Energon, and Energon does image token handling at data preparation. So if we update MCore, it will break MLM SFT! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When sequence packing is used with LLaVA,
PackedSeqParams.cu_seqlensboundaries arrive in collapsed space (1 token per image), but_preprocess_dataexpands each image placeholder intoimg_seq_lenembedding tokens. Thecu_seqlenswere never updated to reflect this expansion, causing attention kernels to see incorrect sub-sequence boundaries — corrupting attention masking, RoPE positions, and CP/SP sharding for every packed sub-sequence.Changes
_remap_packed_seq_params_after_image_expansion()toLLaVAModelwhich translates allcu_seqlensentries from collapsed to expanded space using the same cumulative-sum position mapping as_preprocess_data.forward()right after_preprocess_dataand before the language model.test_forward_packed_vs_batchedparameterized overnum_samples={1, 4, 16}that compares batched-unpacked vs packed forward passes usingtoken_mult_prob_error.Test results
Test uses a real
LLaVAModel(CLIP vision encoder + MLP projection + 3-layer GPT) on GPU with TE attention kernels — no mocking.Test plan
test_forward_packed_vs_batchedfails without the fix (token_mult_prob_error >> 1.02)test_forward,test_preprocess_datastill pass