Skip to content

Commit 0c49bfb

Browse files
committed
SLEP005: Outlier Rejection API
1 parent ca5d9f6 commit 0c49bfb

File tree

2 files changed

+99
-0
lines changed

2 files changed

+99
-0
lines changed

index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
slep002/proposal
2727
slep003/proposal
2828
slep004/proposal
29+
slep005/proposal
2930

3031
.. toctree::
3132
:maxdepth: 1

slep005/proposal.rst

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
.. _slep_005:
2+
3+
=====================
4+
Outlier rejection API
5+
=====================
6+
7+
:Author: Oliver Raush ([email protected]), Guillaume Lemaitre ([email protected])
8+
:Status: Draft
9+
:Type: Standards Track
10+
:Created: created on, in 2019-03-01
11+
:Resolution: <url>
12+
13+
Abstract
14+
--------
15+
16+
We propose a new mixin ``OutlierRejectionMixin`` implementing a
17+
``fit_resample(X, y)`` method. This method will remove samples from
18+
``X`` and ``y`` to get a outlier-free dataset. This method is also
19+
handle in ``Pipeline``.
20+
21+
Detailed description
22+
--------------------
23+
24+
Fitting a machine learning model on an outlier-free dataset can be
25+
beneficial. Currently, the family of outlier detection algorithms
26+
allows to detect outliers using `estimator.fit_predict(X, y)`. However,
27+
there is no mechanism to remove outliers without any manual step. It
28+
is even impossible when a ``Pipeline`` is used.
29+
30+
We propose the following changes:
31+
32+
* implement an ``OutlierRejectionMixin``;
33+
* this mixin add a method ``fit_resample(X, y)`` removing outliers
34+
from ``X`` and ``y``;
35+
* ``fit_resample`` should be handled in ``Pipeline``.
36+
37+
Implementation
38+
--------------
39+
40+
API changes are implemented in
41+
https://github.com/scikit-learn/scikit-learn/pull/13269
42+
43+
Estimator implementation
44+
........................
45+
46+
The new mixin is implemented as::
47+
48+
class OutlierRejectionMixin:
49+
_estimator_type = "outlier_rejector"
50+
def fit_resample(self, X, y):
51+
inliers = self.fit_predict(X) == 1
52+
return safe_mask(X, inliers), safe_mask(y, inliers)
53+
54+
This will be used as follows for the outlier detection algorithms::
55+
56+
class IsolationForest(BaseBagging, OutlierMixin, OutlierRejectionMixin):
57+
...
58+
59+
One can use the new algorithm with::
60+
61+
from sklearn.ensemble import IsolationForest
62+
estimator = IsolationForest()
63+
X_free, y_free = estimator.fit_resample(X, y)
64+
65+
Pipeline implementation
66+
.......................
67+
68+
To handle outlier rejector in ``Pipeline``, we enforce the following:
69+
70+
* an estimator cannot implement both ``fit_resample(X, y)`` and
71+
``fit_transform(X)`` / ``transform(X)``.
72+
* ``fit_predict(X)`` (i.e., clustering methods) should not be called if an
73+
outlier rejector is in the pipeline.
74+
75+
Backward compatibility
76+
----------------------
77+
78+
There is no backward incompatibilities with the current API.
79+
80+
Discussion
81+
----------
82+
83+
* https://github.com/scikit-learn/scikit-learn/pull/13269
84+
85+
References and Footnotes
86+
------------------------
87+
88+
.. [1] Each SLEP must either be explicitly labeled as placed in the public
89+
domain (see this SLEP as an example) or licensed under the `Open
90+
Publication License`_.
91+
92+
.. _Open Publication License: https://www.opencontent.org/openpub/
93+
94+
95+
Copyright
96+
---------
97+
98+
This document has been placed in the public domain. [1]_

0 commit comments

Comments
 (0)