Skip to content

Conversation

@MarcBresson
Copy link

Description

Add feature to encode class labels if they are not correct.

Current behaviour

from sklearn.datasets import make_classification
import numpy as np
from xgboost import XGBClassifier

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=10, n_classes=3, random_state=42)

labels = np.array(["class 0", "class 1", "class 2"])
y_named = labels[y]
model = XGBClassifier()
model.fit(X, y_named)

error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 10
      8 y_named = labels[y]
      9 model = XGBClassifier()
---> 10 model.fit(X, y_named)

File ~/Documents/xgboost/.venv/lib/python3.13/site-packages/xgboost/core.py:729, in require_keyword_args.<locals>.throw_if.<locals>.inner_f(*args, **kwargs)
    727 for k, arg in zip(sig.parameters, args):
    728     kwargs[k] = arg
--> 729 return func(**kwargs)

File ~/Documents/xgboost/.venv/lib/python3.13/site-packages/xgboost/sklearn.py:1641, in XGBClassifier.fit(self, X, y, sample_weight, base_margin, eval_set, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights)
   1636     expected_classes = self.classes_
   1637 if (
   1638     classes.shape != expected_classes.shape
   1639     or not (classes == expected_classes).all()
   1640 ):
-> 1641     raise ValueError(
   1642         f"Invalid classes inferred from unique values of `y`.  "
   1643         f"Expected: {expected_classes}, got {classes}"
   1644     )
   1646 params = self.get_xgb_params()
   1648 if callable(self.objective):

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got ['class 0' 'class 1' 'class 2']

New behaviour

from sklearn.datasets import make_classification
import numpy as np
from xgboost import XGBClassifier

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=10, n_classes=3, random_state=42)

labels = np.array(["class 0", "class 1", "class 2"])
y_named = labels[y]
model = XGBClassifier()
model.fit(X, y_named)

output None without error

@trivialfis trivialfis self-requested a review October 6, 2025 15:45
@trivialfis
Copy link
Member

Thank you for the feature addition. For the sklearn interface, we need to consider some other things for consistency:

  • Model serialization. Is the model still valid if it's saved and loaded?
  • Custom objective/metrics.

Related: #11256

@MarcBresson
Copy link
Author

Hello

Model serialization. Is the model still valid if it's saved and loaded?

According to my testing, I had no issue saving and loading up the model

Custom objective/metrics

Indeed, this can be blocking. If the user need to rely on encoded values, they can still do their own class encoding before passing it to xgboost. Though most of the time it is easy to bring compatibility with class labels, or to use soft predictions instead.

@trivialfis
Copy link
Member

According to my testing, I had no issue saving and loading up the model

It's about loading the model and then the encoder still needs to be valid. The output of the prediction function needs to be the labels.

@MarcBresson
Copy link
Author

Here is the demo

from sklearn.datasets import make_classification
import numpy as np
from xgboost import XGBClassifier

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=10, n_classes=3, random_state=42)

labels = np.array(["class 0", "class 1", "class 2"])
y_named = labels[y]
model = XGBClassifier()
model.fit(X, y_named)

import joblib
joblib.dump(model, "xgb_multiclass_model.joblib")

loaded_model = joblib.load("xgb_multiclass_model.joblib")
predictions = loaded_model.predict(X)
print(predictions)

outputs

['class 2' 'class 1' 'class 2' 'class 2' 'class 2' 'class 1' 'class 2'
 'class 2' 'class 0' 'class 0' 'class 2' 'class 2' 'class 2' 'class 1'
 'class 0' 'class 1' 'class 1' 'class 0' 'class 2' 'class 0' 'class 1'
 'class 0' 'class 2' 'class 0' 'class 1' 'class 2' 'class 0' 'class 2'
 'class 1' 'class 2' 'class 0' 'class 1' 'class 0' 'class 0' 'class 2'
 'class 1' 'class 0' 'class 0' 'class 1' 'class 0' 'class 1' 'class 2'
 'class 0' 'class 2' 'class 2' 'class 1' 'class 2' 'class 0' 'class 1'
 'class 2' 'class 1' 'class 2' 'class 1' 'class 2' 'class 1' 'class 1'
 'class 1' 'class 1' 'class 2' 'class 2' 'class 0' 'class 0' 'class 2'
 'class 0' 'class 2' 'class 2' 'class 0' 'class 0' 'class 2' 'class 2'
 'class 0' 'class 0' 'class 1' 'class 2' 'class 1' 'class 1' 'class 0'
 'class 0' 'class 1' 'class 1' 'class 1' 'class 0' 'class 1' 'class 2'
 'class 1' 'class 0' 'class 1' 'class 1' 'class 0' 'class 0' 'class 0'
 'class 2' 'class 0' 'class 0' 'class 0' 'class 1' 'class 1' 'class 1'
 'class 2' 'class 0']
image

@trivialfis
Copy link
Member

Thank you for sharing. I meant the save_model method. The pickled estimator is not stable across XGBoost, sklearn, and python versions.

@MarcBresson
Copy link
Author

Indeed, there is nothing related to class names inside.

What s your policy for adding new attributes to that json file?

image

@trivialfis
Copy link
Member

It's more of a case-by-case issue. But in general, this feature needs a lot more consideration than simply adding an encoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants