Skip to content

Issue in extending auto-sklearn with new categorical encoders #989

@svsaraf112

Description

@svsaraf112

I am trying to extend auto-sklearn with new categorical encoders like catboost encoder, target encoder etc. I am currently using the scikit-learn-contrib/category_encoder package to do so. For spot check, I am just using the catboost encoder and forbidding other encoders (OHE) to see if I get acceptable results.

Here's the script for catboost encoder (which I have added under autosklearn->pipeline->data_preprocessing->categorical-encoding)


from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import DENSE, SPARSE, UNSIGNED_DATA, INPUT
import category_encoders.cat_boost as catboost_enc

class CatBoostEncoder(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, random_state=None):
        self.random_state = random_state

    def fit(self, X, y=None):
        self.preprocessor = catboost_enc()
        self.preprocessor.fit(X, y)
        return self

    def transform(self, X):
        if self.preprocessor is None:
            raise NotImplementedError()
        return self.preprocessor.transform(X)

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

    @staticmethod
    def get_properties(dataset_properties=None):
        return {'shortname': 'CatBoostEnc',
                'name': 'CatBoost Encoder',
                'handles_regression': True,
                'handles_classification': True,
                'handles_multiclass': True,
                'handles_multilabel': True,
                'handles_multioutput': True,
                # TODO find out of this is right!
                'handles_sparse': True,
                'handles_dense': True,
                'input': (DENSE, SPARSE, UNSIGNED_DATA),
                'output': (INPUT,), }

    @staticmethod
    def get_hyperparameter_search_space(dataset_properties=None):
        return ConfigurationSpace()

For spot check I print the encoders available during the run-time, which are:

Screenshot 2020-11-02 at 2 40 27 PM

But the output pipeline looks something like this:

Screenshot 2020-11-02 at 2 43 42 PM

Is there something that I am missing here? Is there a link I can refer to to add new categorical encoders? Would appreciate any help.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions