Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
7201fc0
working version of the nested pipeline
gui-miotto Sep 17, 2019
00c740d
first moves on the direction of a column transformer autosklearn pipe…
gui-miotto Oct 2, 2019
630e0c4
a working pipeline
gui-miotto Oct 9, 2019
71f1f97
working and tested pipeline
gui-miotto Oct 16, 2019
b410eca
automl in progress
gui-miotto Oct 19, 2019
fbff34e
mod gitignore
gui-miotto Oct 19, 2019
1191bee
more work on automl
gui-miotto Oct 19, 2019
2019490
more work on automl
gui-miotto Oct 19, 2019
e49b505
more work on automl
gui-miotto Oct 19, 2019
572162c
more work on automl
gui-miotto Oct 19, 2019
21962a8
more work on automl
gui-miotto Oct 22, 2019
257351c
automl seems to be working
gui-miotto Oct 22, 2019
d982034
Removed some unnecessary testing files
gui-miotto Oct 27, 2019
e9bd84d
Added some docstrings
gui-miotto Oct 27, 2019
50ed91e
merged CategoryShift with CategoricalImputation to get a cleaner solu…
gui-miotto Oct 28, 2019
2270497
doc string corrections
gui-miotto Oct 28, 2019
12d2349
fixed some unittests
gui-miotto Oct 31, 2019
742d980
Unmerged category shift and categorical imputation
gui-miotto Nov 4, 2019
ddab5ac
Implemented some of Matthias comments
gui-miotto Nov 5, 2019
0bf3f48
corrected some unit tests
gui-miotto Nov 5, 2019
322a444
added a CategoryShift implementation
gui-miotto Nov 5, 2019
a89564d
Added an OHE implementation for sparse datasets. Fixed a couple of O…
gui-miotto Nov 6, 2019
87c0a5a
Code for the minority coalescer choice
gui-miotto Nov 6, 2019
225b8bd
OHE now returns only sparse matrices (keeping the original behavior)
gui-miotto Nov 11, 2019
d689e33
Corrected some OHE unit tests
gui-miotto Nov 12, 2019
c5cabda
Use the new preprocessing pipeline inside the SimpleRegressionPipeline
gui-miotto Nov 12, 2019
538c8b7
fixed some unit tests
gui-miotto Nov 12, 2019
8674c2c
OHE unit test adjustments
gui-miotto Nov 12, 2019
f8e4cfc
readded dataset.pkl
gui-miotto Nov 13, 2019
4d04298
makes sure the input of the feature_type_splitter is dense
gui-miotto Nov 14, 2019
add2611
Modifications on the FeatureTypeSplitter code due to a sklearn's Colu…
gui-miotto Nov 17, 2019
b62d996
Added tests for the SparseOneHotEncoder
gui-miotto Nov 18, 2019
888d3eb
added tests for the CategoryShift implementation
gui-miotto Nov 20, 2019
7eff044
Added tests for the MinorityCoalescer implementation
gui-miotto Nov 27, 2019
be69781
Added tests for CategoricalImputation
gui-miotto Nov 27, 2019
f74cc74
small test adjustments
gui-miotto Nov 27, 2019
5ac6eaa
category_shift.transform(X) now works on a copy of X
gui-miotto Nov 28, 2019
0177853
fixed unittest
gui-miotto Nov 28, 2019
2cf668d
metalearning test fixed
gui-miotto Dec 1, 2019
6c8f931
metalearning test fixed
gui-miotto Dec 1, 2019
a2e2e9a
updated all metalearning configuration.csv tables
gui-miotto Dec 1, 2019
e1f3cbb
use of more convinient names
gui-miotto Dec 2, 2019
d28041b
cleaned last dependencies on the old 1HE
gui-miotto Dec 2, 2019
0692e8f
renaming
gui-miotto Dec 4, 2019
e40ca18
small fixes on test_metalearning_features
gui-miotto Dec 4, 2019
2f3a1b8
removed the utils.datapreprocessing and corrected some unit tests
gui-miotto Dec 8, 2019
f50e740
PEP8
gui-miotto Dec 8, 2019
ba7efe5
OneHotEncoder now uses handle_unknown='ignore'
gui-miotto Dec 8, 2019
a997d96
PEP8
gui-miotto Dec 9, 2019
96c61b3
PEP8
gui-miotto Dec 9, 2019
2152afa
added some new unit tests
gui-miotto Dec 9, 2019
124324b
PEP8
gui-miotto Dec 10, 2019
0c4f237
added missing __init__ file
gui-miotto Dec 10, 2019
76f0387
PEP8
gui-miotto Dec 10, 2019
6e2cfdc
added tests for data_preprocessing_numerical
gui-miotto Dec 10, 2019
87b7608
added unit tests for data_preprocessing.py
gui-miotto Dec 10, 2019
0d8ca34
added unit tests for data_preprocessing
gui-miotto Dec 10, 2019
66454dc
PEP8
gui-miotto Dec 10, 2019
de8f056
corrected fit and transform behavior in the MinorityCoalescer impleme…
gui-miotto Jan 10, 2020
404708b
removed method fit_transformer from NumericalPreprocessingPipeline an…
gui-miotto Jan 10, 2020
b526749
minor modifications suggested on @mfeurer's PR review
gui-miotto Jan 10, 2020
7847d5c
minor modifications suggested by @mfeurer in his PR review
gui-miotto Jan 12, 2020
154596c
small code simplification in DataPreprocessor
gui-miotto Jan 12, 2020
0e97ca4
more modifications suggested by @mfeurer in his PR review
gui-miotto Jan 14, 2020
c61f024
more modifications suggested by @mfeurer in his PR review
gui-miotto Jan 19, 2020
654a8f6
PEP8 fixes
gui-miotto Jan 20, 2020
05f37f0
Improvemnt on PreprocessingPipelineTest
gui-miotto Jan 20, 2020
561f38d
PEP8 fixes
gui-miotto Jan 20, 2020
e764cd9
making sure new components return the correct data type
gui-miotto Jan 20, 2020
a0f32a7
fix unit test test_pca_95percent
gui-miotto Jan 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,4 @@ number_submission
.pypirc
dmypy.json
*.log
.noseids
2 changes: 1 addition & 1 deletion autosklearn/automl.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ def fit(
metric: Scorer,
X_test: Optional[np.ndarray] = None,
y_test: Optional[np.ndarray] = None,
feat_type: Optional[List[bool]] = None,
feat_type: Optional[List[str]] = None,
dataset_name: Optional[str] = None,
only_return_configuration_space: Optional[bool] = False,
load_models: bool = True,
Expand Down
9 changes: 4 additions & 5 deletions autosklearn/data/abstract_data_manager.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# -*- encoding: utf-8 -*-
import abc
import numpy as np
import scipy.sparse

from autosklearn.pipeline.implementations.OneHotEncoder import OneHotEncoder
from autosklearn.pipeline.components.data_preprocessing.data_preprocessing \
import DataPreprocessor
from autosklearn.util import predict_RAM_usage


Expand All @@ -16,9 +16,8 @@ def perform_one_hot_encoding(sparse, categorical, data):

rvals = []
if any(categorical):
encoder = OneHotEncoder(categorical_features=categorical,
dtype=np.float32,
sparse=sparse)
encoder = DataPreprocessor(
categorical_features=categorical, force_sparse_output=sparse)
rvals.append(encoder.fit_transform(data[0]))
for d in data[1:]:
rvals.append(encoder.transform(d))
Expand Down
2 changes: 1 addition & 1 deletion autosklearn/data/xy_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,6 @@ def __init__(self, X, y, X_test, y_test, task, feat_type, dataset_name):
if self.feat_type is None:
self.feat_type = ['Numerical'] * X.shape[1]
if X.shape[1] != len(self.feat_type):
raise ValueError('X and feat type must have the same dimensions, '
raise ValueError('X and feat_type must have the same number of columns, '
'but are %d and %d.' %
(X.shape[1], len(self.feat_type)))
2 changes: 1 addition & 1 deletion autosklearn/evaluation/abstract_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def __init__(self, backend, queue, metric,
raise ValueError(feat)
if np.sum(categorical_mask) > 0:
self._init_params = {
'categorical_encoding:one_hot_encoding:categorical_features':
'data_preprocessing:categorical_features':
categorical_mask
}
else:
Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

17 changes: 5 additions & 12 deletions autosklearn/metalearning/metafeatures/metafeatures.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@
import sklearn.model_selection
from sklearn.utils import check_array
from sklearn.multiclass import OneVsRestClassifier

from sklearn.impute import SimpleImputer
from autosklearn.pipeline.implementations.OneHotEncoder import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from autosklearn.pipeline.components.data_preprocessing.data_preprocessing \
import DataPreprocessor
from autosklearn.util.logging_ import get_logger
from .metafeature import MetaFeature, HelperFunction, DatasetMetafeatures, \
MetaFeatureValue
Expand Down Expand Up @@ -947,16 +947,9 @@ def calculate_all_metafeatures(X, y, categorical, dataset_name,
# TODO make sure this is done as efficient as possible (no copy for
# sparse matrices because of wrong sparse format)
sparse = scipy.sparse.issparse(X)
if any(categorical):
ohe = OneHotEncoder(categorical_features=categorical, sparse=True)
X_transformed = ohe.fit_transform(X)
else:
X_transformed = X
imputer = SimpleImputer(strategy='mean', copy=False)
X_transformed = imputer.fit_transform(X_transformed)
center = not scipy.sparse.isspmatrix(X_transformed)
standard_scaler = StandardScaler(copy=False, with_mean=center)
X_transformed = standard_scaler.fit_transform(X_transformed)
DPP = DataPreprocessor(
categorical_features=categorical, force_sparse_output=True)
X_transformed = DPP.fit_transform(X)
categorical_transformed = [False] * X_transformed.shape[1]

# Densify the transformed matrix
Expand Down
4 changes: 2 additions & 2 deletions autosklearn/pipeline/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,15 +193,15 @@ def set_hyperparameters(self, configuration, init_params=None):
else:
sub_init_params_dict = None

if isinstance(node, (AutoSklearnChoice, AutoSklearnComponent)):
if isinstance(node, (AutoSklearnChoice, AutoSklearnComponent, BasePipeline)):
node.set_hyperparameters(configuration=sub_configuration,
init_params=sub_init_params_dict)
else:
raise NotImplementedError('Not supported yet!')

return self

def get_hyperparameter_search_space(self):
def get_hyperparameter_search_space(self, dataset_properties=None):
"""Return the configuration space for the CASH problem.

Returns
Expand Down
69 changes: 27 additions & 42 deletions autosklearn/pipeline/classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,14 @@
from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.forbidden import ForbiddenEqualsClause, ForbiddenAndConjunction

from autosklearn.pipeline.components.data_preprocessing.data_preprocessing \
import DataPreprocessor
from autosklearn.pipeline.components import classification as \
classification_components
from autosklearn.pipeline.components.data_preprocessing import rescaling as \
rescaling_components
from autosklearn.pipeline.components.data_preprocessing.balancing.balancing import \
Balancing
from autosklearn.pipeline.components.data_preprocessing.imputation.imputation \
import Imputation
from autosklearn.pipeline.components.data_preprocessing.one_hot_encoding \
import OHEChoice
from autosklearn.pipeline.components import feature_preprocessing as \
feature_preprocessing_components
from autosklearn.pipeline.components.data_preprocessing.variance_threshold.variance_threshold \
import VarianceThreshold
from autosklearn.pipeline.base import BasePipeline
from autosklearn.pipeline.constants import SPARSE

Expand All @@ -41,7 +35,7 @@ class SimpleClassificationPipeline(ClassifierMixin, BasePipeline):

Parameters
----------
configuration : ConfigSpace.configuration_space.Configuration
config : ConfigSpace.configuration_space.Configuration
The configuration to evaluate.

random_state : int, RandomState instance or None, optional (default=None)
Expand Down Expand Up @@ -91,7 +85,7 @@ def fit_transformer(self, X, y, fit_params=None):
balancing = Balancing(strategy='weighting')
_init_params, _fit_params = balancing.get_weights(
y, self.configuration['classifier:__choice__'],
self.configuration['preprocessor:__choice__'],
self.configuration['feature_preprocessor:__choice__'],
{}, {})
_init_params.update(self._init_params)
self.set_hyperparameters(configuration=self.configuration,
Expand Down Expand Up @@ -181,7 +175,7 @@ def _get_hyperparameter_search_space(self, include=None, exclude=None,
exclude=exclude, include=include, pipeline=self.steps)

classifiers = cs.get_hyperparameter('classifier:__choice__').choices
preprocessors = cs.get_hyperparameter('preprocessor:__choice__').choices
preprocessors = cs.get_hyperparameter('feature_preprocessor:__choice__').choices
available_classifiers = self._final_estimator.get_available_components(
dataset_properties)

Expand All @@ -197,23 +191,21 @@ def _get_hyperparameter_search_space(self, include=None, exclude=None,
if 'densifier' in preprocessors:
while True:
try:
forb_cls = ForbiddenEqualsClause(
cs.get_hyperparameter('classifier:__choice__'), key)
forb_fpp = ForbiddenEqualsClause(cs.get_hyperparameter(
'feature_preprocessor:__choice__'), 'densifier')
cs.add_forbidden_clause(
ForbiddenAndConjunction(
ForbiddenEqualsClause(
cs.get_hyperparameter(
'classifier:__choice__'), key),
ForbiddenEqualsClause(
cs.get_hyperparameter(
'preprocessor:__choice__'), 'densifier')
))
ForbiddenAndConjunction(forb_cls, forb_fpp))
# Success
break
except ValueError:
# Change the default and try again
try:
default = possible_default_classifier.pop()
except IndexError:
raise ValueError("Cannot find a legal default configuration.")
raise ValueError(
"Cannot find a legal default configuration.")
cs.get_hyperparameter(
'classifier:__choice__').default_value = default

Expand All @@ -236,7 +228,7 @@ def _get_hyperparameter_search_space(self, include=None, exclude=None,
ForbiddenEqualsClause(cs.get_hyperparameter(
"classifier:__choice__"), c),
ForbiddenEqualsClause(cs.get_hyperparameter(
"preprocessor:__choice__"), f)))
"feature_preprocessor:__choice__"), f)))
break
except KeyError:
break
Expand Down Expand Up @@ -265,7 +257,7 @@ def _get_hyperparameter_search_space(self, include=None, exclude=None,
try:
cs.add_forbidden_clause(ForbiddenAndConjunction(
ForbiddenEqualsClause(cs.get_hyperparameter(
"preprocessor:__choice__"), f),
"feature_preprocessor:__choice__"), f),
ForbiddenEqualsClause(cs.get_hyperparameter(
"classifier:__choice__"), c)))
break
Expand All @@ -290,27 +282,20 @@ def _get_pipeline(self):

default_dataset_properties = {'target_type': 'classification'}

# Add the always active preprocessing components

steps.extend(
[["categorical_encoding", OHEChoice(default_dataset_properties)],
["imputation", Imputation()],
["variance_threshold", VarianceThreshold()],
["rescaling",
rescaling_components.RescalingChoice(default_dataset_properties)],
["balancing", Balancing()]])

# Add the preprocessing component
steps.append(['preprocessor',
feature_preprocessing_components.FeaturePreprocessorChoice(
default_dataset_properties)])

# Add the classification component
steps.append(['classifier',
classification_components.ClassifierChoice(
default_dataset_properties)])
steps.extend([
["data_preprocessing",
DataPreprocessor(dataset_properties=default_dataset_properties)],
["balancing",
Balancing()],
["feature_preprocessor",
feature_preprocessing_components.FeaturePreprocessorChoice(
default_dataset_properties)],
['classifier',
classification_components.ClassifierChoice(
default_dataset_properties)]
])

return steps

def _get_estimator_hyperparameter_name(self):
return "classifier"

Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def get_weights(self, Y, classifier, preprocessor, init_params, fit_params):
if classifier in clf_:
fit_params['classifier:sample_weight'] = sample_weights
if preprocessor in pre_:
fit_params['preprocessor:sample_weight'] = sample_weights
fit_params['feature_preprocessor:sample_weight'] = sample_weights

# Classifiers which can adjust sample weights themselves via the
# argument `class_weight`
Expand All @@ -66,7 +66,7 @@ def get_weights(self, Y, classifier, preprocessor, init_params, fit_params):
if classifier in clf_:
init_params['classifier:class_weight'] = 'balanced'
if preprocessor in pre_:
init_params['preprocessor:class_weight'] = 'balanced'
init_params['feature_preprocessor:class_weight'] = 'balanced'

clf_ = ['ridge']
if classifier in clf_:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,4 +94,4 @@ def set_hyperparameters(self, configuration, init_params=None):
return self

def transform(self, X):
return self.choice.transform(X)
return self.choice.transform(X)
Original file line number Diff line number Diff line change
@@ -1,15 +1,7 @@
import numpy as np

import autosklearn.pipeline.implementations.OneHotEncoder

from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter, \
UniformFloatHyperparameter
from ConfigSpace.conditions import EqualsCondition

from autosklearn.pipeline.components.base import \
AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import *
from autosklearn.pipeline.constants import DENSE, SPARSE, UNSIGNED_DATA, INPUT


class NoEncoding(AutoSklearnPreprocessingAlgorithm):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import scipy.sparse

from sklearn.preprocessing import OneHotEncoder as DenseOneHotEncoder

from ConfigSpace.configuration_space import ConfigurationSpace

from autosklearn.pipeline.implementations.SparseOneHotEncoder import SparseOneHotEncoder
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import DENSE, SPARSE, UNSIGNED_DATA, INPUT


class OneHotEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(self, random_state=None):
self.random_state = random_state

def fit(self, X, y=None):
if scipy.sparse.issparse(X):
self.preprocessor = SparseOneHotEncoder()
else:
self.preprocessor = DenseOneHotEncoder(
sparse=False, categories='auto', handle_unknown='ignore')
self.preprocessor.fit(X, y)
return self

def transform(self, X):
if self.preprocessor is None:
raise NotImplementedError()
return self.preprocessor.transform(X)

def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)

@staticmethod
def get_properties(dataset_properties=None):
return {'shortname': '1Hot',
'name': 'One Hot Encoder',
'handles_regression': True,
'handles_classification': True,
'handles_multiclass': True,
'handles_multilabel': True,
# TODO find out of this is right!
'handles_sparse': True,
'handles_dense': True,
'input': (DENSE, SPARSE, UNSIGNED_DATA),
'output': (INPUT,), }

@staticmethod
def get_hyperparameter_search_space(dataset_properties=None):
return ConfigurationSpace()
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import autosklearn.pipeline.implementations.CategoryShift

from ConfigSpace.configuration_space import ConfigurationSpace
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import DENSE, SPARSE, UNSIGNED_DATA, INPUT


class CategoryShift(AutoSklearnPreprocessingAlgorithm):
""" Add 3 to every category.
Down in the pipeline, category 2 will be attribute to missing values,
category 1 will be assigned to low occurence categories, and category 0
is not used, so to provide compatibility with sparse matrices.
"""

def __init__(self, random_state=None):
pass

def fit(self, X, y=None):
self.preprocessor = autosklearn.pipeline.implementations.CategoryShift\
.CategoryShift()
self.preprocessor.fit(X, y)
return self

def transform(self, X):
if self.preprocessor is None:
raise NotImplementedError()
return self.preprocessor.transform(X)

def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)

@staticmethod
def get_properties(dataset_properties=None):
return {'shortname': 'CategShift',
'name': 'Category Shift',
'handles_missing_values': True,
'handles_nominal_values': True,
'handles_numerical_features': True,
'prefers_data_scaled': False,
'prefers_data_normalized': False,
'handles_regression': True,
'handles_classification': True,
'handles_multiclass': True,
'handles_multilabel': True,
'is_deterministic': True,
# TODO find out of this is right!
'handles_sparse': True,
'handles_dense': True,
'input': (DENSE, SPARSE, UNSIGNED_DATA),
'output': (INPUT,),
'preferred_dtype': None}

@staticmethod
def get_hyperparameter_search_space(dataset_properties=None):
return ConfigurationSpace()
Loading