Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_binary.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_multiclass.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_multiclass.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/mean_squared_error_regression_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/r2_regression_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/r2_regression_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/recall_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/recall_binary.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/roc_auc_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@
class BagOfWordEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.min_df_choice = min_df_choice
self.min_df_absolute = min_df_absolute
Expand All @@ -46,13 +46,13 @@ def fit(
if self.min_df_choice == "min_df_absolute":
self.preprocessor = CountVectorizer(
min_df=self.min_df_absolute,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

elif self.min_df_choice == "min_df_relative":
self.preprocessor = CountVectorizer(
min_df=self.min_df_relative,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

else:
Expand Down Expand Up @@ -98,8 +98,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_min_df_choice_bow = CSH.CategoricalHyperparameter(
"min_df_choice", choices=["min_df_absolute", "min_df_relative"]
Expand All @@ -112,7 +112,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_min_df_choice_bow,
hp_min_df_absolute_bow,
hp_min_df_relative_bow,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@
class BagOfWordEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.min_df_choice = min_df_choice
self.min_df_absolute = min_df_absolute
Expand All @@ -40,7 +40,8 @@ def fit(

for feature in X.columns:
vectorizer = CountVectorizer(
min_df=self.min_df_absolute, ngram_range=(1, self.ngram_range)
min_df=self.min_df_absolute,
ngram_range=(1, self.ngram_upper_bound),
).fit(X[feature])
self.preprocessor[feature] = vectorizer

Expand All @@ -50,7 +51,8 @@ def fit(

for feature in X.columns:
vectorizer = CountVectorizer(
min_df=self.min_df_relative, ngram_range=(1, self.ngram_range)
min_df=self.min_df_relative,
ngram_range=(1, self.ngram_upper_bound),
).fit(X[feature])
self.preprocessor[feature] = vectorizer
else:
Expand Down Expand Up @@ -102,8 +104,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_min_df_choice_bow = CSH.CategoricalHyperparameter(
"min_df_choice", choices=["min_df_absolute", "min_df_relative"]
Expand All @@ -116,7 +118,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_min_df_choice_bow,
hp_min_df_absolute_bow,
hp_min_df_relative_bow,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@
class TfidfEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
use_idf: bool = True,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.use_idf = use_idf
self.min_df_choice = min_df_choice
Expand All @@ -50,14 +50,14 @@ def fit(
self.preprocessor = TfidfVectorizer(
min_df=self.min_df_absolute,
use_idf=self.use_idf,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

elif self.min_df_choice == "min_df_relative":
self.preprocessor = TfidfVectorizer(
min_df=self.min_df_relative,
use_idf=self.use_idf,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

else:
Expand Down Expand Up @@ -103,8 +103,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_use_idf = CSH.CategoricalHyperparameter("use_idf", choices=[False, True])
hp_min_df_choice = CSH.CategoricalHyperparameter(
Expand All @@ -118,7 +118,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_use_idf,
hp_min_df_choice,
hp_min_df_absolute,
Expand Down
120 changes: 67 additions & 53 deletions examples/40_advanced/example_text_preprocessing.py
Original file line number Diff line number Diff line change
@@ -1,79 +1,93 @@
# -*- encoding: utf-8 -*-
"""
==================
Text Preprocessing
Text preprocessing
==================
This example shows, how to use text features in *auto-sklearn*. *auto-sklearn* can automatically
encode text features if they are provided as string type in a pandas dataframe.

For processing text features you need a pandas dataframe and set the desired
text columns to string and the categorical columns to category.
The following example shows how to fit a simple NLP problem with
*auto-sklearn*.

*auto-sklearn* text embedding creates a bag of words count.
For an introduction to text preprocessing you can follow these links:
1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
2. https://machinelearningmastery.com/clean-text-machine-learning-python/
"""
from pprint import pprint

import pandas as pd
import sklearn.metrics
import sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

import autosklearn.classification

############################################################################
# Data Loading
# ============
cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
X_train, y_train = fetch_20newsgroups(
subset="train", # select train set
shuffle=True, # shuffle the data set for unbiased validation results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shuffle=True, # shuffle the data set for unbiased validation results
shuffle=True, # shuffle the data set for unbiased validation results

random_state=42, # set a random seed for reproducibility
categories=cats, # select only 2 out of 20 labels
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
) # load this two columns separately as numpy array

X_test, y_test = fetch_20newsgroups(
subset="test", # select test set for unbiased evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subset="test", # select test set for unbiased evaluation
subset="test", # select test set for unbiased evaluation

categories=cats, # select only 2 out of 20 labels
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
) # load this two columns separately as numpy array

X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True)
############################################################################
# Creating a pandas dataframe
# ===========================
# Both categorical and text features are often strings. Python Pandas stores python stings
# in the generic `object` type. Please ensure that the correct
# `dtype <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>` is applied to the correct
# column.

# by default, the columns which should be strings are not formatted as such
print(f"{X.info()}\n")
# create a pandas dataframe for training labeling the "Text" column as sting
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")})

# manually convert these to string columns
X = X.astype(
{
"name": "string",
"ticket": "string",
"cabin": "string",
"boat": "string",
"home.dest": "string",
}
)
# create a pandas dataframe for testing labeling the "Text" column as sting
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")})

# now *auto-sklearn* handles the string columns with its text feature preprocessing pipeline

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
############################################################################
# Build and fit a classifier
# ==========================

# create an autosklearn Classifier or Regressor depending on your task at hand.
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=60, # absolute time limit for fitting the ensemble
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time_left_for_this_task=60, # absolute time limit for fitting the ensemble
time_left_for_this_task=60,

per_run_time_limit=30, # time limit for single models (ensures seeing a variety of models)
tmp_folder="/tmp/autosklearn_text_example_tmp",
)

cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
automl.fit( # fit our model to the training data
X=X_train, # passing training data (encoded as pandas dataframe)
y=y_train, # passing training labels
# ('array like' object: pandas Series, numpy array, python list etc.)
# mapping form X --> y is given by the index ensure that the index of X and y
# match each other
dataset_name="20_Newsgroups",
)

cls.fit(X_train, y_train, X_test, y_test)

predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

############################################################################
# View the models found by auto-sklearn
# =====================================

X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True, as_frame=True)
X = X.select_dtypes(exclude=["object"])
print(automl.leaderboard())

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
############################################################################
# Print the final ensemble constructed by auto-sklearn
# ====================================================

cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
)
pprint(automl.show_models(), indent=4)

cls.fit(X_train, y_train, X_test, y_test)
###########################################################################
# Get the Score of the final ensemble
# ===================================

predictions = cls.predict(X_test)
print(
"Accuracy score without text preprocessing",
sklearn.metrics.accuracy_score(y_test, predictions),
)
# get predictions for formerly unseen data. Ensure that the data has the same format as the training
# data (this also applies to the column names of the pandas dataframe).
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))
Loading