|
| 1 | +Title: Performance and scikit-learn (1/4) |
| 2 | +Date: 2021-12-16 |
| 3 | +Category: scikit-learn |
| 4 | +Slug: sklearn-perf-context |
| 5 | +Lang: en |
| 6 | +Authors: Julien Jerphanion |
| 7 | +Summary: Context: the current state of scikit-learn performance |
| 8 | +Status: Published |
| 9 | + |
| 10 | +## High-level overview of the scikit-learn dependences |
| 11 | + |
| 12 | +scikit-learn is mainly written in Python and is built on top of |
| 13 | +some core libraries of the scientific Python ecosystem. |
| 14 | + |
| 15 | +This ecosystem allows _high expressiveness_ and |
| 16 | +_interactivity_: one can perform complex operations in a few |
| 17 | +lines of code and get the results straight away. |
| 18 | + |
| 19 | +It also allowed setting up simple conventions which makes the |
| 20 | +code-base algorithms easy to understand and improve |
| 21 | +for new contributors. |
| 22 | + |
| 23 | +It also allows delegating most of the complex operations |
| 24 | +to well-tested third-party libraries. For instance, calls |
| 25 | +to functions implemented in |
| 26 | +[`numpy.linalg`](https://numpy.org/doc/stable/reference/routines.linalg.html), |
| 27 | +[`scipy.linalg`](https://docs.scipy.org/doc/scipy/reference/linalg.html ), and |
| 28 | +[`scipy.sparse.linalg`](https://docs.scipy.org/doc/scipy/reference/sparse.linalg.html) |
| 29 | +are delegated to |
| 30 | +[BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms), |
| 31 | +[LAPACK](https://www.netlib.org/lapack/), |
| 32 | +and [ARPACK](https://www.caam.rice.edu/software/ARPACK/) interfaces. |
| 33 | + |
| 34 | +## Main reasons for limited performance |
| 35 | + |
| 36 | +The PyData stack is simple but is not tailored for optimal performance |
| 37 | +for several reasons. |
| 38 | + |
| 39 | +### CPython internals |
| 40 | + |
| 41 | +CPython -- the main implementation of Python -- is slow. |
| 42 | + |
| 43 | +First, CPython has _an interpreter_: there's a cost in converting Python |
| 44 | +instructions into another intermediate representation -- |
| 45 | +the _byte-code_ -- and executing the instructions by interpreting their |
| 46 | +byte-code. |
| 47 | + |
| 48 | +Secondly, nearly every value in CPython is _boxed_ into a `PyObject` |
| 49 | +-- implemented as a C struct. As such, simple operations |
| 50 | +(like adding two floats) come with a non-negligible dispatch |
| 51 | +overhead as the interpreter has to check the type which is unknown |
| 52 | +in advance. |
| 53 | + |
| 54 | +Thirdly, CPython for memory management relies on a global |
| 55 | +mutex on its interpreter called the _Global Interpreter Lock_[ref]For more information about the GIL, see |
| 56 | +[this reference from the Python Wiki](https://wiki.python.org/moin/GlobalInterpreterLock).[/ref]. |
| 57 | +This mechanism comes in handy but computations are restricted in |
| 58 | +most cases to sequential execution in a single thread, removing the benefit |
| 59 | +of using threads. |
| 60 | + |
| 61 | +### Memory-hierarchy suboptimal implementations |
| 62 | + |
| 63 | +`numpy` supports high-level operations but this comes with intermediate |
| 64 | +and dynamically-allocated arrays. |
| 65 | + |
| 66 | +Moreover, this pattern is inefficient from a memory perspective: |
| 67 | +during the execution, blocks of data are moved back and forth |
| 68 | +between the RAM and the different CPU caches several times, not |
| 69 | +making optimal use of the caches. |
| 70 | + |
| 71 | +For instance, based on this minimalistic toy example: |
| 72 | +```python |
| 73 | +import numpy as np |
| 74 | + |
| 75 | +A = np.random.rand(100_000_000) |
| 76 | +B = np.random.rand(100_000_000) |
| 77 | + |
| 78 | +X = np.exp(A + B) |
| 79 | +``` |
| 80 | + |
| 81 | +The following is performed: |
| 82 | + |
| 83 | + - a first temporary array is allocated for `A + B` |
| 84 | + - a second array is allocated to store `np.exp(A + B)` and |
| 85 | + the first temporary array is discarded |
| 86 | + |
| 87 | +This temporary allocation makes the implementation suboptimal |
| 88 | +as memory allocation on the heap is slow. |
| 89 | + |
| 90 | +Furthermore, high-level operations on `X` come with more data |
| 91 | +moves between the RAM and the CPU than needed to compute the |
| 92 | +elements of `X` and hardly make use of the memory hierarchy |
| 93 | +and the size of the caches. |
| 94 | + |
| 95 | +### No "bare-metal" data-structures |
| 96 | + |
| 97 | +The Python ecosystem comes with a few high-level containers |
| 98 | +such as numpy arrays, and pandas DataFrames. |
| 99 | + |
| 100 | +Contrarily to other languages' standard libraries (like the one of |
| 101 | +C++), no "bare-metal" data structures, including heaps, or |
| 102 | +memory-contiguous resizable buffers (as implemented in C++ by |
| 103 | +[`std::priority_queue`](https://en.cppreference.com/w/cpp/container/priority_queue) |
| 104 | +and [`std::vector`](https://en.cppreference.com/w/cpp/container/vector)) |
| 105 | +are available to implement some algorithms efficiently |
| 106 | +from both a computational complexity and a technical perspective. |
| 107 | + |
| 108 | +## Cython: combining the conciseness of Python and the speed of C |
| 109 | + |
| 110 | +In brief, Cython allows transpiling a superset of Python to C code and allows using code that was written in C or C++, which makes bypassing some of CPython's internals possible. Moreover, Cython allows using [OpenMP](https://www.openmp.org/specifications/), an API that allows using lower-level parallelism primitives for implementations written in C or Fortran[ref] For more information on Cython, see [its documentation](https://cython.readthedocs.io/en/latest/).[/ref]. |
| 111 | + |
| 112 | +In most cases, features provided by Cython are sufficient enough to reach optimal implementations for many scientific algorithms for which static tasks scheduling -- at the level of C via OpenMP -- is the most natural and optimal one. |
| 113 | +Plus, its syntax makes this language expressive enough to get nearly optimal performance while keeping the instructions short and concise -- which is a real advantage for developers coming from Python who are looking for performance and relief for C or C++ developers[ref]Compared to C or C++, Cython is also less verbose and can be integrated Python build system more easily.[/ref]. |
| 114 | + |
| 115 | +As such, many algorithms in `scikit-learn` are implemented in Cython performance, some of which use OpenMP when possible. This is for instance the case of `KMeans` which was initially written in Python using numpy and which was rewritten in Cython by Jérémie du Boisberranger, improving the execution time by a factor of 5 for this algorithm[ref]For more information about `KMeans`, see the original contribution, |
| 116 | +[`scikit-learn#11950`](https://github.com/scikit-learn/scikit-learn/pull/11950), and [this blog |
| 117 | +post](https://scikit-learn.fondation-inria.fr/implementing-a-faster-kmeans-in-scikit-learn-0-23-2/).[/ref]. |
| 118 | + |
| 119 | +In the following posts, the case of $k$-nearest neighbors search -- the base routine |
| 120 | +for `KNearestNeighborsClassifier`, `KNearestNeighborsRegressor` and other scikit-learn interfaces -- is covered |
| 121 | +and a new Cython implementation is proposed. |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## Notes |
| 126 | + |
0 commit comments