Use sklearn's ColumnTransformer for data preprocessing #735

gui-miotto · 2019-10-22T18:37:21Z

Use sklearn's newly available ColumnTransformer to create parallel data preprocessing pipelines: one for the categorical features and another for the numerical ones.

…line

…tion on the categorical data preprocessing pipeline

autosklearn/pipeline/components/data_preprocessing/imputation/numerical_imputation.py

autosklearn/pipeline/components/data_preprocessing/one_hot_encoding/minority_coalescer.py

autosklearn/pipeline/components/data_preprocessing/one_hot_encoding/one_hot_encoding.py

autosklearn/pipeline/regression.py

autosklearn/pipeline/components/data_preprocessing/data_preprocessing_categorical.py

autosklearn/pipeline/components/data_preprocessing/data_preprocessing_numerical.py

…E tests

test/test_pipeline/components/data_preprocessing/test_one_hot_encoding.py

mfeurer

Part 1/2

autosklearn/pipeline/components/data_preprocessing/categorical_encoding/one_hot_encoding.py

autosklearn/pipeline/implementations/SparseOneHotEncoder.py

test/test_pipeline/implementations/test_SparseOneHotEncoder.py

test/test_pipeline/components/data_preprocessing/test_category_shift.py

test/test_pipeline/implementations/test_CategoryShift.py

test/test_pipeline/components/data_preprocessing/test_minority_coalescence.py

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py

mfeurer

Part 2/2

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_numerical.py

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py

test/test_pipeline/components/data_preprocessing/test_data_preprocessing.py

test/test_pipeline/test_classification.py

…ntation

…d CategoricalPreprocessingPipeline

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py

test/test_pipeline/components/data_preprocessing/test_data_preprocessing.py

autosklearn/pipeline/implementations/CategoryShift.py

mfeurer · 2020-01-21T09:57:16Z

This looks great now! We're almost ready to merge, but there's a unit test failing right now.

gui-miotto · 2020-01-21T15:46:30Z

Yes. I've seen that. Nevertheless I don't know why. It amazes me that, whatever modification I've done, it affects just this one unit test.
Any idea of what may be causing it?

gui-miotto · 2020-01-26T17:18:58Z

@mfeurer : I found the reason behind the unittest fail.
The old version of the test was transforming the input data in the following way:

ohe = OneHotEncoder(self.categorical)
X_transformed = ohe.fit_transform(X)
imp = SimpleImputer(copy=False)
X_transformed = imp.fit_transform(X_transformed)
center = not scipy.sparse.isspmatrix((X_transformed))
standard_scaler = StandardScaler(with_mean=center)
X_transformed = standard_scaler.fit_transform(X_transformed)
X_transformed = X_transformed.todense()

Notice the (unusual) rescaling of the already one-hot-encoded data. This is something that the new preprocessing pipeline does not do. In the new version of the test, all the lines above were substituted by just two:

DPP = DataPreprocessor(categorical_features=self.categorical)
X_transformed = DPP.fit_transform(X)

If, however, we add add the rescaling as in the old version...

DPP = DataPreprocessor(categorical_features=self.categorical)
X_transformed = DPP.fit_transform(X)
X_transformed = StandardScaler().fit_transform(X_transformed)

... then the test passes.
Because this extra rescaling doesn't make much sense, I'll commit a fix that alters the assert value.
Let me know if you disagree.

@mfeurer

* working version of the nested pipeline * first moves on the direction of a column transformer autosklearn pipeline * a working pipeline * working and tested pipeline * automl in progress * mod gitignore * more work on automl * more work on automl * more work on automl * more work on automl * more work on automl * automl seems to be working * Removed some unnecessary testing files * Added some docstrings * merged CategoryShift with CategoricalImputation to get a cleaner solution on the categorical data preprocessing pipeline * doc string corrections * fixed some unittests * Unmerged category shift and categorical imputation * Implemented some of Matthias comments * corrected some unit tests * added a CategoryShift implementation * Added an OHE implementation for sparse datasets. Fixed a couple of OHE tests * Code for the minority coalescer choice * OHE now returns only sparse matrices (keeping the original behavior) * Corrected some OHE unit tests * Use the new preprocessing pipeline inside the SimpleRegressionPipeline * fixed some unit tests * OHE unit test adjustments * readded dataset.pkl * makes sure the input of the feature_type_splitter is dense * Modifications on the FeatureTypeSplitter code due to a sklearn's ColumnTransformer bug (see sklearn issue #15627) * Added tests for the SparseOneHotEncoder * added tests for the CategoryShift implementation * Added tests for the MinorityCoalescer implementation * Added tests for CategoricalImputation * small test adjustments * category_shift.transform(X) now works on a copy of X * fixed unittest * metalearning test fixed * metalearning test fixed * updated all metalearning configuration.csv tables * use of more convinient names * cleaned last dependencies on the old 1HE * renaming * small fixes on test_metalearning_features * removed the utils.datapreprocessing and corrected some unit tests * PEP8 * OneHotEncoder now uses handle_unknown='ignore' * PEP8 * PEP8 * added some new unit tests * PEP8 * added missing __init__ file * PEP8 * added tests for data_preprocessing_numerical * added unit tests for data_preprocessing.py * added unit tests for data_preprocessing * PEP8 * corrected fit and transform behavior in the MinorityCoalescer implementation * removed method fit_transformer from NumericalPreprocessingPipeline and CategoricalPreprocessingPipeline * minor modifications suggested on @mfeurer's PR review * minor modifications suggested by @mfeurer in his PR review * small code simplification in DataPreprocessor * more modifications suggested by @mfeurer in his PR review * more modifications suggested by @mfeurer in his PR review * PEP8 fixes * Improvemnt on PreprocessingPipelineTest * PEP8 fixes * making sure new components return the correct data type * fix unit test test_pca_95percent

gui-miotto added 17 commits September 17, 2019 14:37

working version of the nested pipeline

7201fc0

first moves on the direction of a column transformer autosklearn pipe…

00c740d

…line

a working pipeline

630e0c4

working and tested pipeline

71f1f97

automl in progress

b410eca

mod gitignore

fbff34e

more work on automl

1191bee

more work on automl

2019490

more work on automl

e49b505

more work on automl

572162c

more work on automl

21962a8

automl seems to be working

257351c

Removed some unnecessary testing files

d982034

Added some docstrings

e9bd84d

merged CategoryShift with CategoricalImputation to get a cleaner solu…

50ed91e

…tion on the categorical data preprocessing pipeline

doc string corrections

2270497

fixed some unittests

12d2349

mfeurer reviewed Oct 31, 2019

View reviewed changes

gui-miotto added 5 commits November 4, 2019 15:29

Unmerged category shift and categorical imputation

742d980

Implemented some of Matthias comments

ddab5ac

corrected some unit tests

0bf3f48

added a CategoryShift implementation

322a444

Added an OHE implementation for sparse datasets. Fixed a couple of OH…

a89564d

…E tests

gui-miotto commented Nov 6, 2019

View reviewed changes

test/test_pipeline/components/data_preprocessing/test_one_hot_encoding.py Show resolved Hide resolved

Code for the minority coalescer choice

87c0a5a

gui-miotto mentioned this pull request Nov 8, 2019

Unused hyperparameters remain active when datasets are purely categorical or purely numerical #741

Closed

gui-miotto added 4 commits November 11, 2019 14:30

OHE now returns only sparse matrices (keeping the original behavior)

225b8bd

Corrected some OHE unit tests

d689e33

Use the new preprocessing pipeline inside the SimpleRegressionPipeline

c5cabda

fixed some unit tests

538c8b7

gui-miotto added 3 commits December 10, 2019 18:25

added unit tests for data_preprocessing.py

87b7608

added unit tests for data_preprocessing

0d8ca34

PEP8

66454dc

mfeurer requested changes Jan 3, 2020

View reviewed changes

mfeurer requested changes Jan 7, 2020

View reviewed changes

gui-miotto added 6 commits January 10, 2020 17:37

corrected fit and transform behavior in the MinorityCoalescer impleme…

de8f056

…ntation

removed method fit_transformer from NumericalPreprocessingPipeline an…

404708b

…d CategoricalPreprocessingPipeline

minor modifications suggested on @mfeurer's PR review

b526749

minor modifications suggested by @mfeurer in his PR review

7847d5c

small code simplification in DataPreprocessor

154596c

more modifications suggested by @mfeurer in his PR review

0e97ca4

mfeurer reviewed Jan 15, 2020

View reviewed changes

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py Outdated Show resolved Hide resolved

mfeurer reviewed Jan 15, 2020

View reviewed changes

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_categorical.py Outdated Show resolved Hide resolved

mfeurer reviewed Jan 15, 2020

View reviewed changes

test/test_pipeline/components/data_preprocessing/test_data_preprocessing.py Outdated Show resolved Hide resolved

mfeurer reviewed Jan 15, 2020

View reviewed changes

autosklearn/pipeline/implementations/CategoryShift.py Outdated Show resolved Hide resolved

gui-miotto added 5 commits January 19, 2020 22:12

more modifications suggested by @mfeurer in his PR review

c61f024

PEP8 fixes

654a8f6

Improvemnt on PreprocessingPipelineTest

05f37f0

PEP8 fixes

561f38d

making sure new components return the correct data type

e764cd9

fix unit test test_pca_95percent

a0f32a7

mfeurer approved these changes Jan 29, 2020

View reviewed changes

mfeurer merged commit 355fbea into automl:development Jan 29, 2020

mfeurer mentioned this pull request Mar 12, 2020

Use new ColumnSelector instead of custom OneHotEncoder #705

Closed

gui-miotto deleted the newonehot branch March 16, 2020 09:46

franchuterivera mentioned this pull request Apr 28, 2020

Release note 070 #842

Merged

Use sklearn's ColumnTransformer for data preprocessing #735

Use sklearn's ColumnTransformer for data preprocessing #735

Uh oh!

Conversation

gui-miotto commented Oct 22, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfeurer commented Jan 21, 2020

Uh oh!

gui-miotto commented Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gui-miotto commented Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gui-miotto commented Jan 21, 2020 •

edited

Loading

gui-miotto commented Jan 26, 2020 •

edited

Loading