feat: Baseten contrib third-party-dataset support by michaelfeil · Pull Request #851 · NVIDIA/Model-Optimizer

michaelfeil · 2026-02-03T22:57:00Z

What does this PR do?

Type of change: ?

Overview: ?
Background: the outage on cnn_dailymail has a couple of issues with out platform, with 20+ customers asking about breaking support for own datasets.

We still require trt-engines to support, which means e.g. the modelopt version cannot be upgraded to latest easily.
Applying the chat template on short context or out-of-distribution material leads to a stronger degradation on harder or generation tasks. This is visible when e.g. running bfcl benchmark, where the model is generating a lot of tokens, the degration with modelopt is large, e.g. 5%+ tokens.

Going forward, we are looking to get stable support for customer-brought datasets into modelopt, because the existing options do not suffice. A hard validation is not desirable, many baseten users will see things like abisee/cnn_dailymail is not in supported datasets, because we use modelopt 0.35.x.

Usage

llm_ptq --dataset baseten/quant_calibration_dataset_v1

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Extended dataset loading to support third‑party and custom datasets, including message-style, prompt, and text formats.
- Tokenizer-aware processing and an option to apply a chat template for message-style datasets.
- Data-loading pipeline now propagates tokenizer context through sample retrieval.
Bug Fixes / Improvements
- Improved warnings and validation for datasets that lack expected structures or chat-template support.

copy-pr-bot · 2026-02-03T22:57:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-03T22:57:33Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds support for loading unregistered (third-party) datasets by introducing an internal helper to extract texts/messages and extends get_dataset_samples() with apply_chat_template and tokenizer parameters; internal callers (including get_dataset_dataloader) were updated to forward the tokenizer and chat-template handling is validated for dataset structure.

Changes

Cohort / File(s)	Summary
Dataset utilities `modelopt/torch/utils/dataset_utils.py`	Added `_third_party_get_dataset_samples(dataset_name, num_samples, tokenizer)` to handle third‑party datasets with `messages`, `prompts`, or text columns. Updated `get_dataset_samples(..., *, apply_chat_template: bool = False, tokenizer=None)` to delegate to the new helper for unsupported datasets, emit warnings for unsupported structures, and apply/chat-template validation. Internal calls updated to pass `tokenizer`. Also updated `get_dataset_dataloader` to propagate `tokenizer` when fetching samples.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant get_dataset_samples
    participant SUPPORTED_CONFIG
    participant ThirdPartyHandler as _third_party_get_dataset_samples
    participant DatasetLoader
    participant Tokenizer

    Caller->>get_dataset_samples: get_dataset_samples(name, num_samples, apply_chat_template, tokenizer)
    get_dataset_samples->>SUPPORTED_CONFIG: is dataset registered?
    alt registered
        get_dataset_samples->>DatasetLoader: load via config
        DatasetLoader-->>get_dataset_samples: texts
    else not registered
        get_dataset_samples->>ThirdPartyHandler: delegate (name, num_samples, tokenizer)
        ThirdPartyHandler->>DatasetLoader: load third-party dataset
        DatasetLoader-->>ThirdPartyHandler: raw records
        alt records contain messages
            ThirdPartyHandler->>Tokenizer: require tokenizer with chat template
            Tokenizer-->>ThirdPartyHandler: formatted texts
        else records contain prompts or text column
            ThirdPartyHandler->>ThirdPartyHandler: extract prompt/text column
        else unsupported structure
            ThirdPartyHandler-->>get_dataset_samples: raise error / warn
        end
        ThirdPartyHandler-->>get_dataset_samples: texts
    end
    get_dataset_samples-->>Caller: return texts

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: adding third-party dataset support to the codebase, which aligns with the feature changes in dataset_utils.py.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 166-170: Fix the typo in the NotImplementedError message inside
dataset_utils.py where the error is raised: change "thrid-party" to
"third-party" in the string that references dataset_name and
get_supported_datasets(); update the message attached to the raise in the block
that checks supported datasets (the f"Dataset {dataset_name} is not
supported..." raise) so the wording reads "third-party" correctly.
- Around line 121-123: Fix the typo in the warning message emitted in
dataset_utils.py: update the string in the warn(...) call that currently reads
"Loading third-party datset {dataset_name} ..." to "Loading third-party dataset
{dataset_name} ..." so the variable dataset_name and the call to
get_supported_datasets() remain unchanged and the warning text is correct.
- Around line 152-159: The error messages incorrectly say "Column {i}" while `i`
is the row/sample index; update both ValueError messages to refer to "Row {i}"
or "Sample {i}" (e.g., "Row {i} in dataset {dataset_name} has no messages..."
and "Row {i} in dataset {dataset_name} has empty text...") where `i`,
`dataset_name`, and the tokenizer.apply_chat_template call appear in
dataset_utils.py so logs accurately reflect the offending sample.
- Around line 145-164: The code ignores the num_samples parameter and thus
iterates the entire (possibly streaming) dataset; update the branches that build
texts to respect num_samples: in the "messages" branch (loop over
dataset/enumerate where tokenizer.apply_chat_template is used) stop collecting
once i >= num_samples (or use itertools.islice to limit iteration), and in the
"prompt" and "text" branches replace the full-list comprehensions with a loop
that appends each sample["prompt"] or sample["text"] and breaks when the
collected count reaches num_samples; ensure num_samples=None means no limit and
keep the same validations (empty messages/text) when samples are collected.

🧹 Nitpick comments (2)

modelopt/torch/utils/dataset_utils.py (2)
131-141: Redundant texts = [] initialization.

texts is initialized at line 131, then re-initialized at line 141 inside the messages branch. The first initialization at line 131 can be removed since each branch handles its own initialization.
♻️ Proposed fix
-    texts = []
     if "messages" in dataset.column_names:
         if tokenizer is None:
             raise ValueError(
195-201: Behavior inconsistency between supported and third-party datasets regarding apply_chat_template.

For supported datasets, apply_chat_template is controlled by the caller. For third-party datasets, chat templates are automatically applied when a messages column exists, ignoring the apply_chat_template parameter.

If this is intentional (since third-party datasets with messages columns require chat templates to be useful), consider documenting this behavior in the docstring. Otherwise, consider passing apply_chat_template to _third_party_get_dataset_samples for consistent behavior.

modelopt/torch/utils/dataset_utils.py

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

michaelfeil · 2026-02-03T23:30:47Z

@CodeRabbit review again.

coderabbitai · 2026-02-03T23:31:00Z

@michaelfeil, I'll review the changes in this PR now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

michaelfeil · 2026-02-03T23:31:36Z

@CodeRabbit full review

coderabbitai · 2026-02-03T23:31:42Z

✅ Actions performed

Full review triggered.

modelopt/torch/utils/dataset_utils.py

codecov · 2026-02-18T13:53:36Z

Codecov Report

❌ Patch coverage is 3.50877% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.43%. Comparing base (e024097) to head (6ededde).
⚠️ Report is 31 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/utils/dataset_utils.py	3.50%	55 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #851      +/-   ##
==========================================
- Coverage   73.73%   73.43%   -0.30%     
==========================================
  Files         196      205       +9     
  Lines       20412    22035    +1623     
==========================================
+ Hits        15050    16181    +1131     
- Misses       5362     5854     +492

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

michaelfeil

did install the latest commit, but seems like this is workable! Thanks so much for the refactor.

modelopt/torch/utils/dataset_utils.py

kevalmorabia97 · 2026-02-18T17:56:27Z

/ok to test 6ededde

kevalmorabia97 · 2026-02-18T18:21:30Z

@michaelfeil You need to sign your commits with an SSH key. Please take a look at https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md#%EF%B8%8F-signing-your-work and force-push all commits (or squashed into 1)

michaelfeil · 2026-02-19T03:04:23Z

can we potentially disable this check (thats what we sometimes do in nvidia-dynamo). But, yes, if needed I can do. I hereby also allowing you to also squash them, my commits are just accidentally not --signoff.

michaelfeil requested a review from a team as a code owner February 3, 2026 22:57

michaelfeil requested a review from kevalmorabia97 February 3, 2026 22:57

michaelfeil force-pushed the mf/support-third-party-dataset branch from 58008a4 to 6a0059b Compare February 3, 2026 22:58

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

third-party-dataset support, for shuffle

a5eb313

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

michaelfeil force-pushed the mf/support-third-party-dataset branch from 6a0059b to a5eb313 Compare February 3, 2026 23:07

michaelfeil added 2 commits February 3, 2026 15:13

add suggestions from coderabbit

00ee933

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

add suggestions from coderabbit

560b503

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

kevalmorabia97 reviewed Feb 4, 2026

View reviewed changes

modelopt/torch/utils/dataset_utils.py Outdated Show resolved Hide resolved

Make dataset_utils logic generic

6ededde

kevalmorabia97 requested review from ChenhanYu and cjluo-nv February 18, 2026 13:35

michaelfeil commented Feb 18, 2026

View reviewed changes

modelopt/torch/utils/dataset_utils.py Show resolved Hide resolved

Conversation

michaelfeil commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelfeil commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Uh oh!

michaelfeil commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Uh oh!

Uh oh!

codecov bot commented Feb 18, 2026

Codecov Report

Uh oh!

michaelfeil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevalmorabia97 commented Feb 18, 2026

Uh oh!

kevalmorabia97 commented Feb 18, 2026

Uh oh!

michaelfeil commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

michaelfeil commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading