Skip to content

Commit 3b4b40e

Browse files
author
Github Actions
committed
Lukas Strack: Change HP Name & Include Text example (#1410)
1 parent 9c94847 commit 3b4b40e

File tree

70 files changed

+2221
-1824
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+2221
-1824
lines changed

development/_downloads/4f9b78e1d6464520c85232e30bf19d2b/example_text_preprocessing.ipynb

Lines changed: 93 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Text Preprocessing\nThis example shows, how to use text features in *auto-sklearn*. *auto-sklearn* can automatically\nencode text features if they are provided as string type in a pandas dataframe.\n\nFor processing text features you need a pandas dataframe and set the desired\ntext columns to string and the categorical columns to category.\n\n*auto-sklearn* text embedding creates a bag of words count.\n"
18+
"\n# Text preprocessing\n\nThe following example shows how to fit a simple NLP problem with\n*auto-sklearn*.\n\nFor an introduction to text preprocessing you can follow these links:\n 1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html\n 2. https://machinelearningmastery.com/clean-text-machine-learning-python/\n"
1919
]
2020
},
2121
{
@@ -26,7 +26,7 @@
2626
},
2727
"outputs": [],
2828
"source": [
29-
"import sklearn.metrics\nimport sklearn.datasets\nimport autosklearn.classification"
29+
"from pprint import pprint\n\nimport pandas as pd\nimport sklearn.metrics\nfrom sklearn.datasets import fetch_20newsgroups\n\nimport autosklearn.classification"
3030
]
3131
},
3232
{
@@ -44,7 +44,97 @@
4444
},
4545
"outputs": [],
4646
"source": [
47-
"X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True)\n\n# by default, the columns which should be strings are not formatted as such\nprint(f\"{X.info()}\\n\")\n\n# manually convert these to string columns\nX = X.astype(\n {\n \"name\": \"string\",\n \"ticket\": \"string\",\n \"cabin\": \"string\",\n \"boat\": \"string\",\n \"home.dest\": \"string\",\n }\n)\n\n# now *auto-sklearn* handles the string columns with its text feature preprocessing pipeline\n\nX_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\n X, y, random_state=1\n)\n\ncls = autosklearn.classification.AutoSklearnClassifier(\n time_left_for_this_task=30,\n # Bellow two flags are provided to speed up calculations\n # Not recommended for a real implementation\n initial_configurations_via_metalearning=0,\n smac_scenario_args={\"runcount_limit\": 1},\n)\n\ncls.fit(X_train, y_train, X_test, y_test)\n\npredictions = cls.predict(X_test)\nprint(\"Accuracy score\", sklearn.metrics.accuracy_score(y_test, predictions))\n\n\nX, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True, as_frame=True)\nX = X.select_dtypes(exclude=[\"object\"])\n\nX_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\n X, y, random_state=1\n)\n\ncls = autosklearn.classification.AutoSklearnClassifier(\n time_left_for_this_task=30,\n # Bellow two flags are provided to speed up calculations\n # Not recommended for a real implementation\n initial_configurations_via_metalearning=0,\n smac_scenario_args={\"runcount_limit\": 1},\n)\n\ncls.fit(X_train, y_train, X_test, y_test)\n\npredictions = cls.predict(X_test)\nprint(\n \"Accuracy score without text preprocessing\",\n sklearn.metrics.accuracy_score(y_test, predictions),\n)"
47+
"cats = [\"comp.sys.ibm.pc.hardware\", \"rec.sport.baseball\"]\nX_train, y_train = fetch_20newsgroups(\n subset=\"train\", # select train set\n shuffle=True, # shuffle the data set for unbiased validation results\n random_state=42, # set a random seed for reproducibility\n categories=cats, # select only 2 out of 20 labels\n return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label\n) # load this two columns separately as numpy array\n\nX_test, y_test = fetch_20newsgroups(\n subset=\"test\", # select test set for unbiased evaluation\n categories=cats, # select only 2 out of 20 labels\n return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label\n) # load this two columns separately as numpy array"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"## Creating a pandas dataframe\nBoth categorical and text features are often strings. Python Pandas stores python stings\nin the generic `object` type. Please ensure that the correct\n`dtype <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>`_ is applied to the correct\ncolumn.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"# create a pandas dataframe for training labeling the \"Text\" column as sting\nX_train = pd.DataFrame({\"Text\": pd.Series(X_train, dtype=\"string\")})\n\n# create a pandas dataframe for testing labeling the \"Text\" column as sting\nX_test = pd.DataFrame({\"Text\": pd.Series(X_test, dtype=\"string\")})"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"## Build and fit a classifier\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"# create an autosklearn Classifier or Regressor depending on your task at hand.\nautoml = autosklearn.classification.AutoSklearnClassifier(\n time_left_for_this_task=60,\n per_run_time_limit=30,\n tmp_folder=\"/tmp/autosklearn_text_example_tmp\",\n)\n\nautoml.fit(X_train, y_train, dataset_name=\"20_Newsgroups\") # fit the automl model"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"## View the models found by auto-sklearn\n\n"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"metadata": {
97+
"collapsed": false
98+
},
99+
"outputs": [],
100+
"source": [
101+
"print(automl.leaderboard())"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"## Print the final ensemble constructed by auto-sklearn\n\n"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {
115+
"collapsed": false
116+
},
117+
"outputs": [],
118+
"source": [
119+
"pprint(automl.show_models(), indent=4)"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"## Get the Score of the final ensemble\n\n"
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {
133+
"collapsed": false
134+
},
135+
"outputs": [],
136+
"source": [
137+
"predictions = automl.predict(X_test)\nprint(\"Accuracy score:\", sklearn.metrics.accuracy_score(y_test, predictions))"
48138
]
49139
}
50140
],
Lines changed: 59 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,84 @@
11
# -*- encoding: utf-8 -*-
22
"""
33
==================
4-
Text Preprocessing
4+
Text preprocessing
55
==================
6-
This example shows, how to use text features in *auto-sklearn*. *auto-sklearn* can automatically
7-
encode text features if they are provided as string type in a pandas dataframe.
86
9-
For processing text features you need a pandas dataframe and set the desired
10-
text columns to string and the categorical columns to category.
7+
The following example shows how to fit a simple NLP problem with
8+
*auto-sklearn*.
119
12-
*auto-sklearn* text embedding creates a bag of words count.
10+
For an introduction to text preprocessing you can follow these links:
11+
1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
12+
2. https://machinelearningmastery.com/clean-text-machine-learning-python/
1313
"""
14+
from pprint import pprint
15+
16+
import pandas as pd
1417
import sklearn.metrics
15-
import sklearn.datasets
18+
from sklearn.datasets import fetch_20newsgroups
19+
1620
import autosklearn.classification
1721

1822
############################################################################
1923
# Data Loading
2024
# ============
25+
cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
26+
X_train, y_train = fetch_20newsgroups(
27+
subset="train", # select train set
28+
shuffle=True, # shuffle the data set for unbiased validation results
29+
random_state=42, # set a random seed for reproducibility
30+
categories=cats, # select only 2 out of 20 labels
31+
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
32+
) # load this two columns separately as numpy array
33+
34+
X_test, y_test = fetch_20newsgroups(
35+
subset="test", # select test set for unbiased evaluation
36+
categories=cats, # select only 2 out of 20 labels
37+
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
38+
) # load this two columns separately as numpy array
2139

22-
X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True)
23-
24-
# by default, the columns which should be strings are not formatted as such
25-
print(f"{X.info()}\n")
26-
27-
# manually convert these to string columns
28-
X = X.astype(
29-
{
30-
"name": "string",
31-
"ticket": "string",
32-
"cabin": "string",
33-
"boat": "string",
34-
"home.dest": "string",
35-
}
36-
)
40+
############################################################################
41+
# Creating a pandas dataframe
42+
# ===========================
43+
# Both categorical and text features are often strings. Python Pandas stores python stings
44+
# in the generic `object` type. Please ensure that the correct
45+
# `dtype <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>`_ is applied to the correct
46+
# column.
3747

38-
# now *auto-sklearn* handles the string columns with its text feature preprocessing pipeline
48+
# create a pandas dataframe for training labeling the "Text" column as sting
49+
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")})
3950

40-
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
41-
X, y, random_state=1
42-
)
51+
# create a pandas dataframe for testing labeling the "Text" column as sting
52+
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")})
4353

44-
cls = autosklearn.classification.AutoSklearnClassifier(
45-
time_left_for_this_task=30,
46-
# Bellow two flags are provided to speed up calculations
47-
# Not recommended for a real implementation
48-
initial_configurations_via_metalearning=0,
49-
smac_scenario_args={"runcount_limit": 1},
54+
############################################################################
55+
# Build and fit a classifier
56+
# ==========================
57+
58+
# create an autosklearn Classifier or Regressor depending on your task at hand.
59+
automl = autosklearn.classification.AutoSklearnClassifier(
60+
time_left_for_this_task=60,
61+
per_run_time_limit=30,
62+
tmp_folder="/tmp/autosklearn_text_example_tmp",
5063
)
5164

52-
cls.fit(X_train, y_train, X_test, y_test)
53-
54-
predictions = cls.predict(X_test)
55-
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
65+
automl.fit(X_train, y_train, dataset_name="20_Newsgroups") # fit the automl model
5666

67+
############################################################################
68+
# View the models found by auto-sklearn
69+
# =====================================
5770

58-
X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True, as_frame=True)
59-
X = X.select_dtypes(exclude=["object"])
71+
print(automl.leaderboard())
6072

61-
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
62-
X, y, random_state=1
63-
)
73+
############################################################################
74+
# Print the final ensemble constructed by auto-sklearn
75+
# ====================================================
6476

65-
cls = autosklearn.classification.AutoSklearnClassifier(
66-
time_left_for_this_task=30,
67-
# Bellow two flags are provided to speed up calculations
68-
# Not recommended for a real implementation
69-
initial_configurations_via_metalearning=0,
70-
smac_scenario_args={"runcount_limit": 1},
71-
)
77+
pprint(automl.show_models(), indent=4)
7278

73-
cls.fit(X_train, y_train, X_test, y_test)
79+
###########################################################################
80+
# Get the Score of the final ensemble
81+
# ===================================
7482

75-
predictions = cls.predict(X_test)
76-
print(
77-
"Accuracy score without text preprocessing",
78-
sklearn.metrics.accuracy_score(y_test, predictions),
79-
)
83+
predictions = automl.predict(X_test)
84+
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))
889 Bytes
Binary file not shown.
1.68 KB
Binary file not shown.
596 Bytes
Loading
-1.1 KB
Loading
-1.13 KB
Loading

0 commit comments

Comments
 (0)