Skip to content

fix: prepare_data crashes with non shuffle split + sample_weight (#887, #1553)#1554

Merged
thinkall merged 1 commit into
microsoft:mainfrom
immu4989:flaml-fix-sample-weight-time-split
May 28, 2026
Merged

fix: prepare_data crashes with non shuffle split + sample_weight (#887, #1553)#1554
thinkall merged 1 commit into
microsoft:mainfrom
immu4989:flaml-fix-sample-weight-time-split

Conversation

@immu4989
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Two related bugs in GenericTask.prepare_data that surface together whenever a user combines a non-shuffle split (split_type="time" or "group") with sample_weight:

  1. [Bug]: holdout + sample_weight + time/group split raises ValueError due to missing elif in prepare_data branching #1553 (ValueError: Found input variables with inconsistent numbers of samples)prepare_data was splitting the data twice. The if split_type == "group": branch at flaml/automl/task/generic_task.py:956 was a standalone if, so after the time/group branch ran (correctly subsetting state.fit_kwargs["sample_weight"]), the chained elif self.is_classification() / elif self.is_regression() branches also fired and tried to split the already-subset weight against the still-full X_train_all. One-character fix: chain the group branch off the preceding time branch with elif.

  2. 'AutoMLState' has no attribute 'sample_weight_all' when split_type='time' #887 (AttributeError: 'AutoMLState' has no attribute 'sample_weight_all')AutoMLState.prepare_sample_train_data reads self.sample_weight_all whenever "sample_weight" is in fit_kwargs, but the attribute was only set inside the SHUFFLE_SPLIT_TYPES branch. For split_type="time" or "group" (or any other non-shuffle path), the attribute was never created, raising the AttributeError during the final retrain step. Fix proposed by @sonichi in 'AutoMLState' has no attribute 'sample_weight_all' when split_type='time' #887 (2023): initialize state.sample_weight_all = sample_weight_full once before the shuffle branching; the SHUFFLE branch still overwrites it after shuffle(...) returns the shuffled weight.

The two bugs surface in the same call (AutoML.fit(... split_type="time", sample_weight=..., retrain_full=True)) — the ValueError fires first in prepare_data, and once that's fixed, the AttributeError becomes reachable during the retrain step. So both fixes share a single regression test.

Related issue number

Closes #887
Closes #1553

Checks

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two related bugs in GenericTask.prepare_data that crash AutoML.fit when combining a non-shuffle split (time/group) with sample_weight. The first bug was caused by a missing elif, leading to double-splitting of the weights; the second by state.sample_weight_all only being set on the shuffle path, causing an AttributeError during final retraining.

Changes:

  • Chain the group holdout branch as elif of the time branch so the time/group/classification/regression split branches are mutually exclusive.
  • Initialize state.sample_weight_all = sample_weight_full before split branching so non-shuffle paths populate the attribute consumed by AutoMLState.prepare_sample_train_data.
  • Add a regression test exercising split_type="time" + sample_weight + retrain_full=True holdout.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
flaml/automl/task/generic_task.py Initializes sample_weight_all for all split paths and makes the group holdout branch mutually exclusive with the surrounding branches.
test/automl/test_split.py Adds regression test covering issues #887 and #1553 (time split + sample_weight + retrain_full).

Copy link
Copy Markdown
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you very much for the PR, @immu4989 .

@thinkall thinkall merged commit a45f4f4 into microsoft:main May 28, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants