Skip to content

Conversation

eddiebergman
Copy link
Contributor

This PR fixes a broken example on extending autosklearn with a NoPreprocessing step for data preprocessing.
It also updates the docs to distinguish this.

Closes #1257

@codecov
Copy link

codecov bot commented Oct 16, 2021

Codecov Report

Merging #1269 (eadd632) into development (502c136) will decrease coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #1269      +/-   ##
===============================================
- Coverage        88.09%   88.02%   -0.07%     
===============================================
  Files              140      140              
  Lines            11144    11147       +3     
===============================================
- Hits              9817     9812       -5     
- Misses            1327     1335       +8     
Impacted Files Coverage Δ
...arn/pipeline/components/classification/__init__.py 84.94% <100.00%> (+0.16%) ⬆️
...pipeline/components/data_preprocessing/__init__.py 82.97% <100.00%> (ø)
...eline/components/feature_preprocessing/__init__.py 89.33% <100.00%> (+0.14%) ⬆️
...sklearn/pipeline/components/regression/__init__.py 83.52% <100.00%> (+0.19%) ⬆️
...eline/components/feature_preprocessing/fast_ica.py 91.30% <0.00%> (-6.53%) ⬇️
...mponents/feature_preprocessing/nystroem_sampler.py 85.29% <0.00%> (-5.89%) ⬇️
autosklearn/util/logging_.py 88.96% <0.00%> (-1.38%) ⬇️
...ine/components/classification/gradient_boosting.py 93.04% <0.00%> (-0.87%) ⬇️
...ipeline/components/regression/gradient_boosting.py 93.26% <0.00%> (+1.92%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 502c136...eadd632. Read the comment docs.

@mdbecker
Copy link
Contributor

I tried the example and it works as is, but if you change n_jobs it fails with Number of crashed target algorithm runs: 5. I assume this has something to do with Dask but don't know enough to be sure. Here is a minimum example to reproduce the failure assuming you've done everything else in the example code:

clf = AutoSklearnClassifier(
    time_left_for_this_task=120,
    include={
        'data_preprocessor': ['NoPreprocessing']
    },
    # Bellow two flags are provided to speed up calculations
    # Not recommended for a real implementation
    initial_configurations_via_metalearning=0,
    smac_scenario_args={'runcount_limit': 5},
    n_jobs=7
)
clf.fit(X_train, y_train)
print(clf.sprint_statistics())

@eddiebergman
Copy link
Contributor Author

I tried the example and it works as is, but if you change n_jobs it fails with Number of crashed target algorithm runs: 5. I assume this has something to do with Dask but don't know enough to be sure. Here is a minimum example to reproduce the failure assuming you've done everything else in the example code:

clf = AutoSklearnClassifier(
    time_left_for_this_task=120,
    include={
        'data_preprocessor': ['NoPreprocessing']
    },
    # Bellow two flags are provided to speed up calculations
    # Not recommended for a real implementation
    initial_configurations_via_metalearning=0,
    smac_scenario_args={'runcount_limit': 5},
    n_jobs=7
)
clf.fit(X_train, y_train)
print(clf.sprint_statistics())

Hi @mdbecker,

Thanks for reporting this, I'll have a look into it but I'm not sure why that would be.

def __init__(self, **kwargs):
""" This preprocessors does not change the data """
self.preprocessor = None
# Some internal checks makes sure parameters are set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by some internal checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure where it was but there is a check somewhere in the pipeline that checks that certain attributes are set on the object from the **kwargs hence the snippet right below this

for key, val in kwargs.items():
    setattr(self, key, val)

@eddiebergman
Copy link
Contributor Author

Hi @mdbecker,

I couldn't reproduce the failure, I've put my entire code snippet below which could hopefully help diagnose any issues.

import autosklearn
from autosklearn.classification import AutoSklearnClassifier
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from sklearn.datasets import load_breast_cancer
import sklearn.metrics
from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT
from sklearn.model_selection import train_test_split
from ConfigSpace.configuration_space import ConfigurationSpace

X, y = load_breast_cancer(return_X_y=True)


class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):

    def __init__(self, **kwargs):
        """This preprocessors does not change the data"""
        # Some internal checks makes sure parameters are set
        for key, val in kwargs.items():
            setattr(self, key, val)

    def fit(self, X, Y=None):
        return self

    def transform(self, X):
        return X

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "NoPreprocessing",
            "name": "NoPreprocessing",
            "handles_regression": True,
            "handles_classification": True,
            "handles_multiclass": True,
            "handles_multilabel": True,
            "handles_multioutput": True,
            "is_deterministic": True,
            "input": (SPARSE, DENSE, UNSIGNED_DATA),
            "output": (INPUT,),
        }

    @staticmethod
    def get_hyperparameter_search_space(dataset_properties=None):
        return ConfigurationSpace()  # Return an empty configuration as there is None


# Add NoPreprocessing component to auto-sklearn.
autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)


if __name__ == "__main__":

    clf = AutoSklearnClassifier(
        time_left_for_this_task=120,
        include={
            'data_preprocessor': ['NoPreprocessing']
        },
        # Bellow two flags are provided to speed up calculations
        # Not recommended for a real implementation
        initial_configurations_via_metalearning=0,
        smac_scenario_args={'runcount_limit': 5},
        n_jobs=7
    )
    clf.fit(X, y)
    print(clf.sprint_statistics())

Output:

[WARNING] [2021-11-03 11:30:21,982:Client-AutoML(1):0bb3609f-3c91-11ec-a202-ec7949506548] Capping the per_run_time_limit to 59.0 to have time for a least 2 models in each process.
auto-sklearn results:
  Dataset name: 0bb3609f-3c91-11ec-a202-ec7949506548
  Metric: accuracy
  Best validation score: 0.957447
  Number of target algorithm runs: 5
  Number of successful target algorithm runs: 4
  Number of crashed target algorithm runs: 1
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

@eddiebergman eddiebergman merged commit 89d6018 into development Nov 3, 2021
@eddiebergman eddiebergman deleted the update_example_extending_data_preprocessing branch November 3, 2021 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Turning of the data preprocessing step causes algorithms to crash
3 participants