add drop parameter to OneHotEncoder#934
Open
karen-elisha wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Is your feature request related to a problem?
When
drop_last=True,OneHotEncoderalways drops the last category (by insertion order). Users cannot control which category is used as the reference group. In many modeling scenarios (e.g., logistic regression), the choice of the reference category matters.What does this PR do?
Adds a new
dropparameter toOneHotEncoderthat lets users choose which category to drop:drop="last"— drops the last category alphabeticallydrop="first"— drops the first category alphabeticallydrop="most_frequent"— drops the most frequent category found duringfit()Backward compatibility
drop_lastparameter continues to work as before.drop_lastanddropare set, aFutureWarningis raised anddroptakes precedence.Edge case handling
drop="most_frequent"and multiple categories are tied for the highest frequency, aUserWarningis raised and the first category alphabetically among the tied ones is dropped.Files changed
feature_engine/encoding/one_hot.py— addeddropparameter, validation, and fit logictests/test_encoding/test_onehot_encoder.py— added 7 new tests covering alldropoptionsTests
All 39 tests pass: