Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_binary.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_multiclass.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_multiclass.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/mean_squared_error_regression_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/r2_regression_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/r2_regression_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/recall_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/recall_binary.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/roc_auc_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@
class BagOfWordEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.min_df_choice = min_df_choice
self.min_df_absolute = min_df_absolute
Expand All @@ -46,13 +46,13 @@ def fit(
if self.min_df_choice == "min_df_absolute":
self.preprocessor = CountVectorizer(
min_df=self.min_df_absolute,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

elif self.min_df_choice == "min_df_relative":
self.preprocessor = CountVectorizer(
min_df=self.min_df_relative,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

else:
Expand Down Expand Up @@ -98,8 +98,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_min_df_choice_bow = CSH.CategoricalHyperparameter(
"min_df_choice", choices=["min_df_absolute", "min_df_relative"]
Expand All @@ -112,7 +112,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_min_df_choice_bow,
hp_min_df_absolute_bow,
hp_min_df_relative_bow,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@
class BagOfWordEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.min_df_choice = min_df_choice
self.min_df_absolute = min_df_absolute
Expand All @@ -40,7 +40,8 @@ def fit(

for feature in X.columns:
vectorizer = CountVectorizer(
min_df=self.min_df_absolute, ngram_range=(1, self.ngram_range)
min_df=self.min_df_absolute,
ngram_range=(1, self.ngram_upper_bound),
).fit(X[feature])
self.preprocessor[feature] = vectorizer

Expand All @@ -50,7 +51,8 @@ def fit(

for feature in X.columns:
vectorizer = CountVectorizer(
min_df=self.min_df_relative, ngram_range=(1, self.ngram_range)
min_df=self.min_df_relative,
ngram_range=(1, self.ngram_upper_bound),
).fit(X[feature])
self.preprocessor[feature] = vectorizer
else:
Expand Down Expand Up @@ -102,8 +104,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_min_df_choice_bow = CSH.CategoricalHyperparameter(
"min_df_choice", choices=["min_df_absolute", "min_df_relative"]
Expand All @@ -116,7 +118,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_min_df_choice_bow,
hp_min_df_absolute_bow,
hp_min_df_relative_bow,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@
class TfidfEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
use_idf: bool = True,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.use_idf = use_idf
self.min_df_choice = min_df_choice
Expand All @@ -50,14 +50,14 @@ def fit(
self.preprocessor = TfidfVectorizer(
min_df=self.min_df_absolute,
use_idf=self.use_idf,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

elif self.min_df_choice == "min_df_relative":
self.preprocessor = TfidfVectorizer(
min_df=self.min_df_relative,
use_idf=self.use_idf,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

else:
Expand Down Expand Up @@ -103,8 +103,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_use_idf = CSH.CategoricalHyperparameter("use_idf", choices=[False, True])
hp_min_df_choice = CSH.CategoricalHyperparameter(
Expand All @@ -118,7 +118,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_use_idf,
hp_min_df_choice,
hp_min_df_absolute,
Expand Down
113 changes: 59 additions & 54 deletions examples/40_advanced/example_text_preprocessing.py
Original file line number Diff line number Diff line change
@@ -1,79 +1,84 @@
# -*- encoding: utf-8 -*-
"""
==================
Text Preprocessing
Text preprocessing
==================
This example shows, how to use text features in *auto-sklearn*. *auto-sklearn* can automatically
encode text features if they are provided as string type in a pandas dataframe.

For processing text features you need a pandas dataframe and set the desired
text columns to string and the categorical columns to category.
The following example shows how to fit a simple NLP problem with
*auto-sklearn*.

*auto-sklearn* text embedding creates a bag of words count.
For an introduction to text preprocessing you can follow these links:
1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
2. https://machinelearningmastery.com/clean-text-machine-learning-python/
"""
from pprint import pprint

import pandas as pd
import sklearn.metrics
import sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

import autosklearn.classification

############################################################################
# Data Loading
# ============
cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
X_train, y_train = fetch_20newsgroups(
subset="train", # select train set
shuffle=True, # shuffle the data set for unbiased validation results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shuffle=True, # shuffle the data set for unbiased validation results
shuffle=True, # shuffle the data set for unbiased validation results

random_state=42, # set a random seed for reproducibility
categories=cats, # select only 2 out of 20 labels
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
) # load this two columns separately as numpy array

X_test, y_test = fetch_20newsgroups(
subset="test", # select test set for unbiased evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subset="test", # select test set for unbiased evaluation
subset="test", # select test set for unbiased evaluation

categories=cats, # select only 2 out of 20 labels
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
) # load this two columns separately as numpy array

X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True)

# by default, the columns which should be strings are not formatted as such
print(f"{X.info()}\n")

# manually convert these to string columns
X = X.astype(
{
"name": "string",
"ticket": "string",
"cabin": "string",
"boat": "string",
"home.dest": "string",
}
)
############################################################################
# Creating a pandas dataframe
# ===========================
# Both categorical and text features are often strings. Python Pandas stores python stings
# in the generic `object` type. Please ensure that the correct
# `dtype <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>`_ is applied to the correct
# column.

# now *auto-sklearn* handles the string columns with its text feature preprocessing pipeline
# create a pandas dataframe for training labeling the "Text" column as sting
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")})

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
# create a pandas dataframe for testing labeling the "Text" column as sting
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")})

cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
############################################################################
# Build and fit a classifier
# ==========================

# create an autosklearn Classifier or Regressor depending on your task at hand.
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=60,
per_run_time_limit=30,
tmp_folder="/tmp/autosklearn_text_example_tmp",
)

cls.fit(X_train, y_train, X_test, y_test)

predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
automl.fit(X_train, y_train, dataset_name="20_Newsgroups") # fit the automl model

############################################################################
# View the models found by auto-sklearn
# =====================================

X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True, as_frame=True)
X = X.select_dtypes(exclude=["object"])
print(automl.leaderboard())

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
############################################################################
# Print the final ensemble constructed by auto-sklearn
# ====================================================

cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
)
pprint(automl.show_models(), indent=4)

cls.fit(X_train, y_train, X_test, y_test)
###########################################################################
# Get the Score of the final ensemble
# ===================================

predictions = cls.predict(X_test)
print(
"Accuracy score without text preprocessing",
sklearn.metrics.accuracy_score(y_test, predictions),
)
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))
Loading