-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Use sklearn's ColumnTransformer for data preprocessing #735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tion on the categorical data preprocessing pipeline
autosklearn/pipeline/components/data_preprocessing/imputation/numerical_imputation.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/imputation/numerical_imputation.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/one_hot_encoding/minority_coalescer.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/one_hot_encoding/minority_coalescer.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/one_hot_encoding/minority_coalescer.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/one_hot_encoding/minority_coalescer.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/one_hot_encoding/one_hot_encoding.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/data_preprocessing_categorical.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/data_preprocessing_numerical.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_one_hot_encoding.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part 1/2
autosklearn/pipeline/components/data_preprocessing/categorical_encoding/one_hot_encoding.py
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_category_shift.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_minority_coalescence.py
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part 2/2
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_numerical.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing.py
Outdated
Show resolved
Hide resolved
…d CategoricalPreprocessingPipeline
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing.py
Outdated
Show resolved
Hide resolved
This looks great now! We're almost ready to merge, but there's a unit test failing right now. |
Yes. I've seen that. Nevertheless I don't know why. It amazes me that, whatever modification I've done, it affects just this one unit test. |
@mfeurer : I found the reason behind the unittest fail. ohe = OneHotEncoder(self.categorical)
X_transformed = ohe.fit_transform(X)
imp = SimpleImputer(copy=False)
X_transformed = imp.fit_transform(X_transformed)
center = not scipy.sparse.isspmatrix((X_transformed))
standard_scaler = StandardScaler(with_mean=center)
X_transformed = standard_scaler.fit_transform(X_transformed)
X_transformed = X_transformed.todense() Notice the (unusual) rescaling of the already one-hot-encoded data. This is something that the new preprocessing pipeline does not do. In the new version of the test, all the lines above were substituted by just two: DPP = DataPreprocessor(categorical_features=self.categorical)
X_transformed = DPP.fit_transform(X) If, however, we add add the rescaling as in the old version... DPP = DataPreprocessor(categorical_features=self.categorical)
X_transformed = DPP.fit_transform(X)
X_transformed = StandardScaler().fit_transform(X_transformed) ... then the test passes. |
* working version of the nested pipeline * first moves on the direction of a column transformer autosklearn pipeline * a working pipeline * working and tested pipeline * automl in progress * mod gitignore * more work on automl * more work on automl * more work on automl * more work on automl * more work on automl * automl seems to be working * Removed some unnecessary testing files * Added some docstrings * merged CategoryShift with CategoricalImputation to get a cleaner solution on the categorical data preprocessing pipeline * doc string corrections * fixed some unittests * Unmerged category shift and categorical imputation * Implemented some of Matthias comments * corrected some unit tests * added a CategoryShift implementation * Added an OHE implementation for sparse datasets. Fixed a couple of OHE tests * Code for the minority coalescer choice * OHE now returns only sparse matrices (keeping the original behavior) * Corrected some OHE unit tests * Use the new preprocessing pipeline inside the SimpleRegressionPipeline * fixed some unit tests * OHE unit test adjustments * readded dataset.pkl * makes sure the input of the feature_type_splitter is dense * Modifications on the FeatureTypeSplitter code due to a sklearn's ColumnTransformer bug (see sklearn issue #15627) * Added tests for the SparseOneHotEncoder * added tests for the CategoryShift implementation * Added tests for the MinorityCoalescer implementation * Added tests for CategoricalImputation * small test adjustments * category_shift.transform(X) now works on a copy of X * fixed unittest * metalearning test fixed * metalearning test fixed * updated all metalearning configuration.csv tables * use of more convinient names * cleaned last dependencies on the old 1HE * renaming * small fixes on test_metalearning_features * removed the utils.datapreprocessing and corrected some unit tests * PEP8 * OneHotEncoder now uses handle_unknown='ignore' * PEP8 * PEP8 * added some new unit tests * PEP8 * added missing __init__ file * PEP8 * added tests for data_preprocessing_numerical * added unit tests for data_preprocessing.py * added unit tests for data_preprocessing * PEP8 * corrected fit and transform behavior in the MinorityCoalescer implementation * removed method fit_transformer from NumericalPreprocessingPipeline and CategoricalPreprocessingPipeline * minor modifications suggested on @mfeurer's PR review * minor modifications suggested by @mfeurer in his PR review * small code simplification in DataPreprocessor * more modifications suggested by @mfeurer in his PR review * more modifications suggested by @mfeurer in his PR review * PEP8 fixes * Improvemnt on PreprocessingPipelineTest * PEP8 fixes * making sure new components return the correct data type * fix unit test test_pca_95percent
Use sklearn's newly available ColumnTransformer to create parallel data preprocessing pipelines: one for the categorical features and another for the numerical ones.