feat: Baseten contrib third-party-dataset support#851
feat: Baseten contrib third-party-dataset support#851michaelfeil wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds support for loading unregistered (third-party) datasets by introducing an internal helper to extract texts/messages and extends Changes
Sequence DiagramsequenceDiagram
participant Caller
participant get_dataset_samples
participant SUPPORTED_CONFIG
participant ThirdPartyHandler as _third_party_get_dataset_samples
participant DatasetLoader
participant Tokenizer
Caller->>get_dataset_samples: get_dataset_samples(name, num_samples, apply_chat_template, tokenizer)
get_dataset_samples->>SUPPORTED_CONFIG: is dataset registered?
alt registered
get_dataset_samples->>DatasetLoader: load via config
DatasetLoader-->>get_dataset_samples: texts
else not registered
get_dataset_samples->>ThirdPartyHandler: delegate (name, num_samples, tokenizer)
ThirdPartyHandler->>DatasetLoader: load third-party dataset
DatasetLoader-->>ThirdPartyHandler: raw records
alt records contain messages
ThirdPartyHandler->>Tokenizer: require tokenizer with chat template
Tokenizer-->>ThirdPartyHandler: formatted texts
else records contain prompts or text column
ThirdPartyHandler->>ThirdPartyHandler: extract prompt/text column
else unsupported structure
ThirdPartyHandler-->>get_dataset_samples: raise error / warn
end
ThirdPartyHandler-->>get_dataset_samples: texts
end
get_dataset_samples-->>Caller: return texts
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
58008a4 to
6a0059b
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 166-170: Fix the typo in the NotImplementedError message inside
dataset_utils.py where the error is raised: change "thrid-party" to
"third-party" in the string that references dataset_name and
get_supported_datasets(); update the message attached to the raise in the block
that checks supported datasets (the f"Dataset {dataset_name} is not
supported..." raise) so the wording reads "third-party" correctly.
- Around line 121-123: Fix the typo in the warning message emitted in
dataset_utils.py: update the string in the warn(...) call that currently reads
"Loading third-party datset {dataset_name} ..." to "Loading third-party dataset
{dataset_name} ..." so the variable dataset_name and the call to
get_supported_datasets() remain unchanged and the warning text is correct.
- Around line 152-159: The error messages incorrectly say "Column {i}" while `i`
is the row/sample index; update both ValueError messages to refer to "Row {i}"
or "Sample {i}" (e.g., "Row {i} in dataset {dataset_name} has no messages..."
and "Row {i} in dataset {dataset_name} has empty text...") where `i`,
`dataset_name`, and the tokenizer.apply_chat_template call appear in
dataset_utils.py so logs accurately reflect the offending sample.
- Around line 145-164: The code ignores the num_samples parameter and thus
iterates the entire (possibly streaming) dataset; update the branches that build
texts to respect num_samples: in the "messages" branch (loop over
dataset/enumerate where tokenizer.apply_chat_template is used) stop collecting
once i >= num_samples (or use itertools.islice to limit iteration), and in the
"prompt" and "text" branches replace the full-list comprehensions with a loop
that appends each sample["prompt"] or sample["text"] and breaks when the
collected count reaches num_samples; ensure num_samples=None means no limit and
keep the same validations (empty messages/text) when samples are collected.
🧹 Nitpick comments (2)
modelopt/torch/utils/dataset_utils.py (2)
131-141: Redundanttexts = []initialization.
textsis initialized at line 131, then re-initialized at line 141 inside themessagesbranch. The first initialization at line 131 can be removed since each branch handles its own initialization.♻️ Proposed fix
- texts = [] if "messages" in dataset.column_names: if tokenizer is None: raise ValueError(
195-201: Behavior inconsistency between supported and third-party datasets regardingapply_chat_template.For supported datasets,
apply_chat_templateis controlled by the caller. For third-party datasets, chat templates are automatically applied when amessagescolumn exists, ignoring theapply_chat_templateparameter.If this is intentional (since third-party datasets with
messagescolumns require chat templates to be useful), consider documenting this behavior in the docstring. Otherwise, consider passingapply_chat_templateto_third_party_get_dataset_samplesfor consistent behavior.
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
6a0059b to
a5eb313
Compare
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
|
@CodeRabbit review again. |
|
✅ Actions performedReview triggered.
|
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #851 +/- ##
==========================================
- Coverage 73.73% 73.43% -0.30%
==========================================
Files 196 205 +9
Lines 20412 22035 +1623
==========================================
+ Hits 15050 16181 +1131
- Misses 5362 5854 +492 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
michaelfeil
left a comment
There was a problem hiding this comment.
did install the latest commit, but seems like this is workable! Thanks so much for the refactor.
|
/ok to test 6ededde |
|
@michaelfeil You need to sign your commits with an SSH key. Please take a look at https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md#%EF%B8%8F-signing-your-work and force-push all commits (or squashed into 1) |
|
can we potentially disable this check (thats what we sometimes do in nvidia-dynamo). But, yes, if needed I can do. I hereby also allowing you to also squash them, my commits are just accidentally not --signoff. |
What does this PR do?
Type of change: ?
Overview: ?
Background: the outage on
cnn_dailymailhas a couple of issues with out platform, with 20+ customers asking about breaking support for own datasets.Going forward, we are looking to get stable support for customer-brought datasets into modelopt, because the existing options do not suffice. A hard validation is not desirable, many baseten users will see things like
abisee/cnn_dailymailis not in supported datasets, because we use modelopt 0.35.x.Usage
llm_ptq --dataset baseten/quant_calibration_dataset_v1# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes / Improvements