-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Change HP Name & Include Text example #1410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
45b3b7e
e3ef23f
666086d
db62290
cdaeed5
83bfb75
efadf85
e7d5db4
328dcae
7b1112f
f96b758
38e5e2f
93d2164
bac27b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,79 +1,84 @@ | ||||||
# -*- encoding: utf-8 -*- | ||||||
""" | ||||||
================== | ||||||
Text Preprocessing | ||||||
Text preprocessing | ||||||
================== | ||||||
This example shows, how to use text features in *auto-sklearn*. *auto-sklearn* can automatically | ||||||
encode text features if they are provided as string type in a pandas dataframe. | ||||||
|
||||||
For processing text features you need a pandas dataframe and set the desired | ||||||
text columns to string and the categorical columns to category. | ||||||
The following example shows how to fit a simple NLP problem with | ||||||
*auto-sklearn*. | ||||||
|
||||||
*auto-sklearn* text embedding creates a bag of words count. | ||||||
For an introduction to text preprocessing you can follow these links: | ||||||
1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html | ||||||
2. https://machinelearningmastery.com/clean-text-machine-learning-python/ | ||||||
""" | ||||||
from pprint import pprint | ||||||
|
||||||
import pandas as pd | ||||||
import sklearn.metrics | ||||||
import sklearn.datasets | ||||||
from sklearn.datasets import fetch_20newsgroups | ||||||
|
||||||
import autosklearn.classification | ||||||
|
||||||
############################################################################ | ||||||
# Data Loading | ||||||
# ============ | ||||||
cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"] | ||||||
X_train, y_train = fetch_20newsgroups( | ||||||
subset="train", # select train set | ||||||
shuffle=True, # shuffle the data set for unbiased validation results | ||||||
random_state=42, # set a random seed for reproducibility | ||||||
categories=cats, # select only 2 out of 20 labels | ||||||
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label | ||||||
) # load this two columns separately as numpy array | ||||||
|
||||||
X_test, y_test = fetch_20newsgroups( | ||||||
subset="test", # select test set for unbiased evaluation | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
categories=cats, # select only 2 out of 20 labels | ||||||
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label | ||||||
) # load this two columns separately as numpy array | ||||||
|
||||||
X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True) | ||||||
|
||||||
# by default, the columns which should be strings are not formatted as such | ||||||
print(f"{X.info()}\n") | ||||||
|
||||||
# manually convert these to string columns | ||||||
X = X.astype( | ||||||
{ | ||||||
"name": "string", | ||||||
"ticket": "string", | ||||||
"cabin": "string", | ||||||
"boat": "string", | ||||||
"home.dest": "string", | ||||||
} | ||||||
) | ||||||
############################################################################ | ||||||
# Creating a pandas dataframe | ||||||
# =========================== | ||||||
# Both categorical and text features are often strings. Python Pandas stores python stings | ||||||
# in the generic `object` type. Please ensure that the correct | ||||||
# `dtype <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>`_ is applied to the correct | ||||||
# column. | ||||||
|
||||||
# now *auto-sklearn* handles the string columns with its text feature preprocessing pipeline | ||||||
# create a pandas dataframe for training labeling the "Text" column as sting | ||||||
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")}) | ||||||
|
||||||
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( | ||||||
X, y, random_state=1 | ||||||
) | ||||||
# create a pandas dataframe for testing labeling the "Text" column as sting | ||||||
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")}) | ||||||
|
||||||
cls = autosklearn.classification.AutoSklearnClassifier( | ||||||
time_left_for_this_task=30, | ||||||
# Bellow two flags are provided to speed up calculations | ||||||
# Not recommended for a real implementation | ||||||
initial_configurations_via_metalearning=0, | ||||||
smac_scenario_args={"runcount_limit": 1}, | ||||||
############################################################################ | ||||||
# Build and fit a classifier | ||||||
# ========================== | ||||||
|
||||||
# create an autosklearn Classifier or Regressor depending on your task at hand. | ||||||
automl = autosklearn.classification.AutoSklearnClassifier( | ||||||
time_left_for_this_task=60, | ||||||
per_run_time_limit=30, | ||||||
tmp_folder="/tmp/autosklearn_text_example_tmp", | ||||||
) | ||||||
|
||||||
cls.fit(X_train, y_train, X_test, y_test) | ||||||
|
||||||
predictions = cls.predict(X_test) | ||||||
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions)) | ||||||
automl.fit(X_train, y_train, dataset_name="20_Newsgroups") # fit the automl model | ||||||
|
||||||
############################################################################ | ||||||
# View the models found by auto-sklearn | ||||||
# ===================================== | ||||||
|
||||||
X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True, as_frame=True) | ||||||
X = X.select_dtypes(exclude=["object"]) | ||||||
print(automl.leaderboard()) | ||||||
|
||||||
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( | ||||||
X, y, random_state=1 | ||||||
) | ||||||
############################################################################ | ||||||
# Print the final ensemble constructed by auto-sklearn | ||||||
# ==================================================== | ||||||
|
||||||
cls = autosklearn.classification.AutoSklearnClassifier( | ||||||
time_left_for_this_task=30, | ||||||
# Bellow two flags are provided to speed up calculations | ||||||
# Not recommended for a real implementation | ||||||
initial_configurations_via_metalearning=0, | ||||||
smac_scenario_args={"runcount_limit": 1}, | ||||||
) | ||||||
pprint(automl.show_models(), indent=4) | ||||||
|
||||||
cls.fit(X_train, y_train, X_test, y_test) | ||||||
########################################################################### | ||||||
# Get the Score of the final ensemble | ||||||
# =================================== | ||||||
|
||||||
predictions = cls.predict(X_test) | ||||||
print( | ||||||
"Accuracy score without text preprocessing", | ||||||
sklearn.metrics.accuracy_score(y_test, predictions), | ||||||
) | ||||||
predictions = automl.predict(X_test) | ||||||
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.