Skip to content

Conversation

gui-miotto
Copy link
Contributor

Use sklearn's newly available ColumnTransformer to create parallel data preprocessing pipelines: one for the categorical features and another for the numerical ones.

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 1/2

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 2/2

@mfeurer
Copy link
Contributor

mfeurer commented Jan 21, 2020

This looks great now! We're almost ready to merge, but there's a unit test failing right now.

@gui-miotto
Copy link
Contributor Author

gui-miotto commented Jan 21, 2020

Yes. I've seen that. Nevertheless I don't know why. It amazes me that, whatever modification I've done, it affects just this one unit test.
Any idea of what may be causing it?

@gui-miotto
Copy link
Contributor Author

gui-miotto commented Jan 26, 2020

@mfeurer : I found the reason behind the unittest fail.
The old version of the test was transforming the input data in the following way:

ohe = OneHotEncoder(self.categorical)
X_transformed = ohe.fit_transform(X)
imp = SimpleImputer(copy=False)
X_transformed = imp.fit_transform(X_transformed)
center = not scipy.sparse.isspmatrix((X_transformed))
standard_scaler = StandardScaler(with_mean=center)
X_transformed = standard_scaler.fit_transform(X_transformed)
X_transformed = X_transformed.todense()

Notice the (unusual) rescaling of the already one-hot-encoded data. This is something that the new preprocessing pipeline does not do. In the new version of the test, all the lines above were substituted by just two:

DPP = DataPreprocessor(categorical_features=self.categorical)
X_transformed = DPP.fit_transform(X)

If, however, we add add the rescaling as in the old version...

DPP = DataPreprocessor(categorical_features=self.categorical)
X_transformed = DPP.fit_transform(X)
X_transformed = StandardScaler().fit_transform(X_transformed)

... then the test passes.
Because this extra rescaling doesn't make much sense, I'll commit a fix that alters the assert value.
Let me know if you disagree.

@mfeurer mfeurer merged commit 355fbea into automl:development Jan 29, 2020
@gui-miotto gui-miotto deleted the newonehot branch March 16, 2020 09:46
@franchuterivera franchuterivera mentioned this pull request Apr 28, 2020
franchuterivera pushed a commit to franchuterivera/auto-sklearn that referenced this pull request Aug 21, 2020
* working version of the nested pipeline

* first moves on the direction of a column transformer autosklearn pipeline

* a working pipeline

* working and tested pipeline

* automl in progress

* mod gitignore

* more work on automl

* more work on automl

* more work on automl

* more work on automl

* more work on automl

* automl seems to be working

* Removed some unnecessary testing files

* Added some docstrings

* merged CategoryShift with CategoricalImputation to get a cleaner solution on the categorical data preprocessing pipeline

* doc string corrections

* fixed some unittests

* Unmerged category shift and categorical imputation

* Implemented some of Matthias comments

* corrected some unit tests

* added a CategoryShift implementation

* Added an OHE implementation for sparse datasets. Fixed a couple of  OHE tests

* Code for the minority coalescer choice

* OHE now returns only sparse matrices (keeping the original behavior)

* Corrected some OHE unit tests

* Use the new preprocessing pipeline inside the SimpleRegressionPipeline

* fixed some unit tests

* OHE unit test adjustments

* readded dataset.pkl

* makes sure the input of the feature_type_splitter is dense

* Modifications on the FeatureTypeSplitter code due to a sklearn's ColumnTransformer bug (see sklearn issue #15627)

* Added tests for the SparseOneHotEncoder

* added tests for the CategoryShift implementation

* Added tests for the MinorityCoalescer implementation

* Added tests for CategoricalImputation

* small test adjustments

* category_shift.transform(X) now works on a copy of X

* fixed unittest

* metalearning test fixed

* metalearning test fixed

* updated all metalearning configuration.csv tables

* use of more convinient names

* cleaned last dependencies on the old 1HE

* renaming

* small fixes on test_metalearning_features

* removed the utils.datapreprocessing and corrected some unit tests

* PEP8

* OneHotEncoder now uses handle_unknown='ignore'

* PEP8

* PEP8

* added some new unit tests

* PEP8

* added missing __init__ file

* PEP8

* added tests for data_preprocessing_numerical

* added unit tests for data_preprocessing.py

* added unit tests for data_preprocessing

* PEP8

* corrected fit and transform behavior in the MinorityCoalescer implementation

* removed method fit_transformer from NumericalPreprocessingPipeline and CategoricalPreprocessingPipeline

* minor modifications suggested on @mfeurer's PR review

* minor modifications suggested by @mfeurer in his PR review

* small code simplification in DataPreprocessor

* more modifications suggested by @mfeurer in his PR review

* more modifications suggested by @mfeurer in his PR review

* PEP8 fixes

* Improvemnt on PreprocessingPipelineTest

* PEP8 fixes

* making sure new components return the correct data type

* fix unit test test_pca_95percent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants