-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
Description
Describe the bug
When the dataset has outliers and is big enough to be subsampled, it can produce a probability matrix which has fewer columns than classes in the training data.
To Reproduce
import numpy as np
from autosklearn.experimental.askl2 import AutoSklearn2Classifier
x = np.random.random(size=(60_000_017, 10))
y = np.asarray([1]*30_000_000 + [2]*30_000_000 + list(range(3,20)))
aml = AutoSklearn2Classifier(time_left_for_this_task=60, memory_limit=10_000)
aml.fit(x, y)
predictions = aml.predict(x)
probabilities = aml.predict_proba(x)
print(probabilities.shape)
(60000017, 5)
Alternatively much slower with the automl benchmark on KDDCup:
python runbenchmark.py autosklearn2:latest openml/t/360112 1h8c -f 5 -m docker -s force
Expected behavior
The number of columns in the probability matrix to match the number of classes in the training data.
(60000017, 19)
Or alternatively a way to tell for which column belongs to which class and for which classes no predictions have been made.
Actual behavior, stacktrace or logfile
(venv) root@486c0ae472af:/bench# python mwe.py
[WARNING] [2021-07-27 16:19:41,000:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing the precision from float64 to <class 'numpy.float32'>
[WARNING] [2021-07-27 16:19:42,210:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing number of samples from 60000017 to 13107200.
[WARNING] [2021-07-27 16:19:45,795:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Could not sample dataset in stratified manner, resorting to random sampling
Traceback (most recent call last):
File "/bench/frameworks/autosklearn/lib/auto-sklearn/autosklearn/automl.py", line 940, in subsample_if_too_large
stratify=y,
File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 2197, in train_test_split
train, test = next(cv.split(X=arrays[0], y=stratify))
File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1387, in split
for train, test in self._iter_indices(X, y, groups):
File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1715, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/smac/intensification/parallel_scheduling.py:152: UserWarning: SuccessiveHalving is intended to be used with more than 1 worker but num_workers=1
num_workers
(60000017, 5)
Environment and installation:
Please give details about your installation:
- OS: Debian 10 in docker hosted by Windows 10
- virtual environment
- Python version: 3.7.11
- Auto-sklearn version: development (
11afae22b8c9a6309d2b6fcf7cfb9a947711cd1e
)