|
| 1 | +.. _slep_005: |
| 2 | + |
| 3 | +===================== |
| 4 | +Outlier rejection API |
| 5 | +===================== |
| 6 | + |
| 7 | +:Author: Oliver Raush ( [email protected]), Guillaume Lemaitre ( [email protected]) |
| 8 | +:Status: Draft |
| 9 | +:Type: Standards Track |
| 10 | +:Created: created on, in 2019-03-01 |
| 11 | +:Resolution: <url> |
| 12 | + |
| 13 | +Abstract |
| 14 | +-------- |
| 15 | + |
| 16 | +We propose a new mixin ``OutlierRejectionMixin`` implementing a |
| 17 | +``fit_resample(X, y)`` method. This method will remove samples from |
| 18 | +``X`` and ``y`` to get a outlier-free dataset. This method is also |
| 19 | +handle in ``Pipeline``. |
| 20 | + |
| 21 | +Detailed description |
| 22 | +-------------------- |
| 23 | + |
| 24 | +Fitting a machine learning model on an outlier-free dataset can be |
| 25 | +beneficial. Currently, the family of outlier detection algorithms |
| 26 | +allows to detect outliers using `estimator.fit_predict(X, y)`. However, |
| 27 | +there is no mechanism to remove outliers without any manual step. It |
| 28 | +is even impossible when a ``Pipeline`` is used. |
| 29 | + |
| 30 | +We propose the following changes: |
| 31 | + |
| 32 | +* implement an ``OutlierRejectionMixin``; |
| 33 | +* this mixin add a method ``fit_resample(X, y)`` removing outliers |
| 34 | + from ``X`` and ``y``; |
| 35 | +* ``fit_resample`` should be handled in ``Pipeline``. |
| 36 | + |
| 37 | +Implementation |
| 38 | +-------------- |
| 39 | + |
| 40 | +API changes are implemented in |
| 41 | +https://github.com/scikit-learn/scikit-learn/pull/13269 |
| 42 | + |
| 43 | +Estimator implementation |
| 44 | +........................ |
| 45 | + |
| 46 | +The new mixin is implemented as:: |
| 47 | + |
| 48 | + class OutlierRejectionMixin: |
| 49 | + _estimator_type = "outlier_rejector" |
| 50 | + def fit_resample(self, X, y): |
| 51 | + inliers = self.fit_predict(X) == 1 |
| 52 | + return safe_mask(X, inliers), safe_mask(y, inliers) |
| 53 | + |
| 54 | +This will be used as follows for the outlier detection algorithms:: |
| 55 | + |
| 56 | + class IsolationForest(BaseBagging, OutlierMixin, OutlierRejectionMixin): |
| 57 | + ... |
| 58 | + |
| 59 | +One can use the new algorithm with:: |
| 60 | + |
| 61 | + from sklearn.ensemble import IsolationForest |
| 62 | + estimator = IsolationForest() |
| 63 | + X_free, y_free = estimator.fit_resample(X, y) |
| 64 | + |
| 65 | +Pipeline implementation |
| 66 | +....................... |
| 67 | + |
| 68 | +To handle outlier rejector in ``Pipeline``, we enforce the following: |
| 69 | + |
| 70 | +* an estimator cannot implement both ``fit_resample(X, y)`` and |
| 71 | + ``fit_transform(X)`` / ``transform(X)``. |
| 72 | +* ``fit_predict(X)`` (i.e., clustering methods) should not be called if an |
| 73 | + outlier rejector is in the pipeline. |
| 74 | + |
| 75 | +Backward compatibility |
| 76 | +---------------------- |
| 77 | + |
| 78 | +There is no backward incompatibilities with the current API. |
| 79 | + |
| 80 | +Discussion |
| 81 | +---------- |
| 82 | + |
| 83 | +* https://github.com/scikit-learn/scikit-learn/pull/13269 |
| 84 | + |
| 85 | +References and Footnotes |
| 86 | +------------------------ |
| 87 | + |
| 88 | +.. [1] Each SLEP must either be explicitly labeled as placed in the public |
| 89 | + domain (see this SLEP as an example) or licensed under the `Open |
| 90 | + Publication License`_. |
| 91 | +
|
| 92 | +.. _Open Publication License: https://www.opencontent.org/openpub/ |
| 93 | + |
| 94 | + |
| 95 | +Copyright |
| 96 | +--------- |
| 97 | + |
| 98 | +This document has been placed in the public domain. [1]_ |
0 commit comments