Skip to content

Commit 8fa62b3

Browse files
committed
Simply copy sources of blog posts
Sources are made for Pelican with the MathJax plugin. They need to be adapted. Signed-off-by: Julien Jerphanion <[email protected]>
1 parent 8b75afe commit 8fa62b3

File tree

5 files changed

+713
-0
lines changed

5 files changed

+713
-0
lines changed

_posts/sklearn-perf-context.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
Title: Performance and scikit-learn (1/4)
2+
Date: 2021-12-16
3+
Category: scikit-learn
4+
Slug: sklearn-perf-context
5+
Lang: en
6+
Authors: Julien Jerphanion
7+
Summary: Context: the current state of scikit-learn performance
8+
Status: Published
9+
10+
## High-level overview of the scikit-learn dependences
11+
12+
scikit-learn is mainly written in Python and is built on top of
13+
some core libraries of the scientific Python ecosystem.
14+
15+
This ecosystem allows _high expressiveness_ and
16+
_interactivity_: one can perform complex operations in a few
17+
lines of code and get the results straight away.
18+
19+
It also allowed setting up simple conventions which makes the
20+
code-base algorithms easy to understand and improve
21+
for new contributors.
22+
23+
It also allows delegating most of the complex operations
24+
to well-tested third-party libraries. For instance, calls
25+
to functions implemented in
26+
[`numpy.linalg`](https://numpy.org/doc/stable/reference/routines.linalg.html),
27+
[`scipy.linalg`](https://docs.scipy.org/doc/scipy/reference/linalg.html ), and
28+
[`scipy.sparse.linalg`](https://docs.scipy.org/doc/scipy/reference/sparse.linalg.html)
29+
are delegated to
30+
[BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms),
31+
[LAPACK](https://www.netlib.org/lapack/),
32+
and [ARPACK](https://www.caam.rice.edu/software/ARPACK/) interfaces.
33+
34+
## Main reasons for limited performance
35+
36+
The PyData stack is simple but is not tailored for optimal performance
37+
for several reasons.
38+
39+
### CPython internals
40+
41+
CPython -- the main implementation of Python -- is slow.
42+
43+
First, CPython has _an interpreter_: there's a cost in converting Python
44+
instructions into another intermediate representation --
45+
the _byte-code_ -- and executing the instructions by interpreting their
46+
byte-code.
47+
48+
Secondly, nearly every value in CPython is _boxed_ into a `PyObject`
49+
-- implemented as a C struct. As such, simple operations
50+
(like adding two floats) come with a non-negligible dispatch
51+
overhead as the interpreter has to check the type which is unknown
52+
in advance.
53+
54+
Thirdly, CPython for memory management relies on a global
55+
mutex on its interpreter called the _Global Interpreter Lock_[ref]For more information about the GIL, see
56+
[this reference from the Python Wiki](https://wiki.python.org/moin/GlobalInterpreterLock).[/ref].
57+
This mechanism comes in handy but computations are restricted in
58+
most cases to sequential execution in a single thread, removing the benefit
59+
of using threads.
60+
61+
### Memory-hierarchy suboptimal implementations
62+
63+
`numpy` supports high-level operations but this comes with intermediate
64+
and dynamically-allocated arrays.
65+
66+
Moreover, this pattern is inefficient from a memory perspective:
67+
during the execution, blocks of data are moved back and forth
68+
between the RAM and the different CPU caches several times, not
69+
making optimal use of the caches.
70+
71+
For instance, based on this minimalistic toy example:
72+
```python
73+
import numpy as np
74+
75+
A = np.random.rand(100_000_000)
76+
B = np.random.rand(100_000_000)
77+
78+
X = np.exp(A + B)
79+
```
80+
81+
The following is performed:
82+
83+
- a first temporary array is allocated for `A + B`
84+
- a second array is allocated to store `np.exp(A + B)` and
85+
the first temporary array is discarded
86+
87+
This temporary allocation makes the implementation suboptimal
88+
as memory allocation on the heap is slow.
89+
90+
Furthermore, high-level operations on `X` come with more data
91+
moves between the RAM and the CPU than needed to compute the
92+
elements of `X` and hardly make use of the memory hierarchy
93+
and the size of the caches.
94+
95+
### No "bare-metal" data-structures
96+
97+
The Python ecosystem comes with a few high-level containers
98+
such as numpy arrays, and pandas DataFrames.
99+
100+
Contrarily to other languages' standard libraries (like the one of
101+
C++), no "bare-metal" data structures, including heaps, or
102+
memory-contiguous resizable buffers (as implemented in C++ by
103+
[`std::priority_queue`](https://en.cppreference.com/w/cpp/container/priority_queue)
104+
and [`std::vector`](https://en.cppreference.com/w/cpp/container/vector))
105+
are available to implement some algorithms efficiently
106+
from both a computational complexity and a technical perspective.
107+
108+
## Cython: combining the conciseness of Python and the speed of C
109+
110+
In brief, Cython allows transpiling a superset of Python to C code and allows using code that was written in C or C++, which makes bypassing some of CPython's internals possible. Moreover, Cython allows using [OpenMP](https://www.openmp.org/specifications/), an API that allows using lower-level parallelism primitives for implementations written in C or Fortran[ref] For more information on Cython, see [its documentation](https://cython.readthedocs.io/en/latest/).[/ref].
111+
112+
In most cases, features provided by Cython are sufficient enough to reach optimal implementations for many scientific algorithms for which static tasks scheduling -- at the level of C via OpenMP -- is the most natural and optimal one.
113+
Plus, its syntax makes this language expressive enough to get nearly optimal performance while keeping the instructions short and concise -- which is a real advantage for developers coming from Python who are looking for performance and relief for C or C++ developers[ref]Compared to C or C++, Cython is also less verbose and can be integrated Python build system more easily.[/ref].
114+
115+
As such, many algorithms in `scikit-learn` are implemented in Cython performance, some of which use OpenMP when possible. This is for instance the case of `KMeans` which was initially written in Python using numpy and which was rewritten in Cython by Jérémie du Boisberranger, improving the execution time by a factor of 5 for this algorithm[ref]For more information about `KMeans`, see the original contribution,
116+
[`scikit-learn#11950`](https://github.com/scikit-learn/scikit-learn/pull/11950), and [this blog
117+
post](https://scikit-learn.fondation-inria.fr/implementing-a-faster-kmeans-in-scikit-learn-0-23-2/).[/ref].
118+
119+
In the following posts, the case of $k$-nearest neighbors search -- the base routine
120+
for `KNearestNeighborsClassifier`, `KNearestNeighborsRegressor` and other scikit-learn interfaces -- is covered
121+
and a new Cython implementation is proposed.
122+
123+
---
124+
125+
## Notes
126+

_posts/sklearn-perf-knn.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
Title: Performance and scikit-learn (2/4)
2+
Date: 2021-12-17
3+
Category: scikit-learn
4+
Slug: sklearn-perf-knn
5+
Lang: en
6+
Authors: Julien Jerphanion
7+
Summary: Hardware scalability issue: the k-neighbors search example
8+
Status: Published
9+
10+
## $k$-nearest neighbors search in scikit-learn
11+
12+
$k$-nearest neighbors search is at the base of many implementations used within scikit-learn.
13+
14+
For instance, it is used in Affinity Propagation, BIRCH, Mean Shift, OPTICS,
15+
Spectral Clustering, TSNE, KNeighbors Regressor, and KNeighbors Classifier.
16+
17+
Whilst many libraries implement approximated versions of $k$-nearest neighbors search to speed-up
18+
the execution time[ref]Approximate nearest neighbors search algorithms come in many different
19+
flavors, and there's even [a benchmark suite comparing them!](https://ann-benchmarks.com/).[/ref], scikit-learn's implementation aims at returning the exact $k$-nearest neighbors.
20+
21+
22+
## Computing chunks of the distance matrix computations
23+
24+
The general steps for $k$-nearest neighbors search are:
25+
26+
- Compute $\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$, the distance matrix between the vectors of two
27+
arrays $\mathbf{X}$ and $\mathbf{Y}$.
28+
- Reduce rows of $\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$ appropriately for the given algorithm:
29+
for instance, the adapted reduction for $k$-nn search is to return the $k$ smallest indices of values in an ordered set.
30+
In what follows, we call this reduction $\texttt{argkmin}$.
31+
- Perform extra operations on results of the reductions (here sort values).
32+
33+
Generally, one does not compute $\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$ entirely because its
34+
space complexity is $\Theta(n_x \times n_y)$. Practically,
35+
$\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$ does not fit in RAM for sound datasets.
36+
37+
Thus, in practice, one computes chunks of this dataset and reduced them directly.
38+
This is what was performed as of scikit-learn 1.0[ref][`KNearestNeighbors.kneighbors`](https://github.com/scikit-learn/scikit-learn/blob/c762c407873b8d6417b1c2ff78d19d82550e48d3/sklearn/neighbors/_base.py#L650) is the interface to look for.[/ref].
39+
40+
41+
## Current issues
42+
43+
The current implementation relies on a general parallelization scheme using higher-level functions with
44+
[`joblib.Parallel`](https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
45+
46+
Technically, this is not the most efficient: working at this level with views on numpy arrays moves
47+
large chunks of data back and forth several times between the RAM and the CPUs' caches, hardly
48+
make use of caches, and allocate temporary results.
49+
50+
Moreover, the cost of manipulating those chunks of data in the CPython interpreter causes a non
51+
negligible overhead because they are associated with Python objects which are bound to the
52+
Global Lock Interpreter for counting their references.
53+
54+
Hence, this does not allow getting proper _hardware scalability_: an ideal parallel
55+
implementation of an algorithm would run close to $n$ times faster when running on
56+
$n$ threads on $n$ CPU cores compared to sequentially on $1$ thread.
57+
58+
For instance, the current implementation of $k$-nearest neighbors search of scikit-learn
59+
cannot efficiently leverage all the available CPUs on a machine -- as shown by the figure below.
60+
61+
![Scalability of `kneighbors` as of scikit-learn 1.0](https://user-images.githubusercontent.com/13029839/155144242-6c041729-154b-47aa-9069-3a7d26deef5a.svg)
62+
63+
When using $8$ threads, it only run $\times 2$ faster than the sequential implementation
64+
and adding more threads and CPUs beyond $8$ does not help to get better performance.
65+
66+
In the next blog, we go over the design of a new implementation to improve the scalability of $k$-nn search.
67+
68+
---
69+
70+
## Notes

0 commit comments

Comments
 (0)