Skip to content

auto-sklearn produces probability matrix inconsistent with training input #1190

@PGijsbers

Description

@PGijsbers

Describe the bug

When the dataset has outliers and is big enough to be subsampled, it can produce a probability matrix which has fewer columns than classes in the training data.

To Reproduce

import numpy as np
from autosklearn.experimental.askl2 import AutoSklearn2Classifier

x = np.random.random(size=(60_000_017, 10))
y = np.asarray([1]*30_000_000 + [2]*30_000_000 + list(range(3,20)))

aml = AutoSklearn2Classifier(time_left_for_this_task=60, memory_limit=10_000)
aml.fit(x, y)
predictions = aml.predict(x)
probabilities = aml.predict_proba(x)

print(probabilities.shape)

(60000017, 5)

Alternatively much slower with the automl benchmark on KDDCup:

python runbenchmark.py autosklearn2:latest openml/t/360112 1h8c -f 5 -m docker -s force

Expected behavior

The number of columns in the probability matrix to match the number of classes in the training data.

(60000017, 19)

Or alternatively a way to tell for which column belongs to which class and for which classes no predictions have been made.

Actual behavior, stacktrace or logfile

(venv) root@486c0ae472af:/bench# python mwe.py
[WARNING] [2021-07-27 16:19:41,000:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing the precision from float64 to <class 'numpy.float32'>
[WARNING] [2021-07-27 16:19:42,210:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing number of samples from 60000017 to 13107200.
[WARNING] [2021-07-27 16:19:45,795:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Could not sample dataset in stratified manner, resorting to random sampling
Traceback (most recent call last):
  File "/bench/frameworks/autosklearn/lib/auto-sklearn/autosklearn/automl.py", line 940, in subsample_if_too_large
    stratify=y,
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 2197, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1387, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1715, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/smac/intensification/parallel_scheduling.py:152: UserWarning: SuccessiveHalving is intended to be used with more than 1 worker but num_workers=1
  num_workers
(60000017, 5)

Environment and installation:

Please give details about your installation:

  • OS: Debian 10 in docker hosted by Windows 10
  • virtual environment
  • Python version: 3.7.11
  • Auto-sklearn version: development (11afae22b8c9a6309d2b6fcf7cfb9a947711cd1e)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions