Skip to content

Commit 610e437

Browse files
author
Github Actions
committed
eddiebergman: Add: Doc for dataset_compression
1 parent 9bf796f commit 610e437

File tree

86 files changed

+3128
-3890
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+3128
-3890
lines changed

development/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 8a26f7fbaa1576935d6b4916c5b79de9
3+
config: 19b39b196a4ce26d6f98b3eb2c061df5
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.
Binary file not shown.
-4.2 KB
Loading
2.29 KB
Loading
2.55 KB
Loading
662 Bytes
Loading
966 Bytes
Loading

development/_modules/autosklearn/estimators.html

Lines changed: 67 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@
6363
<li><a href="../../index.html">Start</a></li>
6464
<li><a href="../../releases.html">Releases</a></li>
6565
<li><a href="../../installation.html">Installation</a></li>
66+
<li><a href="../../manual.html">Manual</a></li>
6667
<li><a href="../../examples/index.html">Examples</a></li>
6768
<li><a href="../../api.html">API</a></li>
6869
<li><a href="../../extending.html">Extending</a></li>
@@ -268,58 +269,39 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
268269
<span class="sd"> &#39;feature_preprocessor&#39;: [&quot;no_preprocessing&quot;]</span>
269270
<span class="sd"> }</span>
270271

271-
<span class="sd"> resampling_strategy : Union[str, BaseCrossValidator, _RepeatedSplits, BaseShuffleSplit] = &quot;holdout&quot;</span>
272+
<span class="sd"> resampling_strategy : str | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = &quot;holdout&quot;</span>
272273
<span class="sd"> How to to handle overfitting, might need to use ``resampling_strategy_arguments``</span>
273274
<span class="sd"> if using ``&quot;cv&quot;`` based method or a Splitter object.</span>
274275

276+
<span class="sd"> * **Options**</span>
277+
<span class="sd"> * ``&quot;holdout&quot;`` - Use a 67:33 (train:test) split</span>
278+
<span class="sd"> * ``&quot;cv&quot;``: perform cross validation, requires &quot;folds&quot; in ``resampling_strategy_arguments``</span>
279+
<span class="sd"> * ``&quot;holdout-iterative-fit&quot;`` - Same as &quot;holdout&quot; but iterative fit where possible</span>
280+
<span class="sd"> * ``&quot;cv-iterative-fit&quot;``: Same as &quot;cv&quot; but iterative fit where possible</span>
281+
<span class="sd"> * ``&quot;partial-cv&quot;``: Same as &quot;cv&quot; but uses intensification.</span>
282+
<span class="sd"> * ``BaseCrossValidator`` - any BaseCrossValidator subclass (found in scikit-learn model_selection module)</span>
283+
<span class="sd"> * ``_RepeatedSplits`` - any _RepeatedSplits subclass (found in scikit-learn model_selection module)</span>
284+
<span class="sd"> * ``BaseShuffleSplit`` - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)</span>
285+
275286
<span class="sd"> If using a Splitter object that relies on the dataset retaining it&#39;s current</span>
276287
<span class="sd"> size and order, you will need to look at the ``dataset_compression`` argument</span>
277288
<span class="sd"> and ensure that ``&quot;subsample&quot;`` is not included in the applied compression</span>
278289
<span class="sd"> ``&quot;methods&quot;`` or disable it entirely with ``False``.</span>
279290

280-
<span class="sd"> **Options**</span>
281-
282-
<span class="sd"> * ``&quot;holdout&quot;``:</span>
283-
<span class="sd"> 67:33 (train:test) split</span>
284-
<span class="sd"> * ``&quot;holdout-iterative-fit&quot;``:</span>
285-
<span class="sd"> 67:33 (train:test) split, iterative fit where possible</span>
286-
<span class="sd"> * ``&quot;cv&quot;``:</span>
287-
<span class="sd"> crossvalidation,</span>
288-
<span class="sd"> requires ``&quot;folds&quot;`` in ``resampling_strategy_arguments``</span>
289-
<span class="sd"> * ``&quot;cv-iterative-fit&quot;``:</span>
290-
<span class="sd"> crossvalidation,</span>
291-
<span class="sd"> calls iterative fit where possible,</span>
292-
<span class="sd"> requires ``&quot;folds&quot;`` in ``resampling_strategy_arguments``</span>
293-
<span class="sd"> * &#39;partial-cv&#39;:</span>
294-
<span class="sd"> crossvalidation with intensification,</span>
295-
<span class="sd"> requires ``&quot;folds&quot;`` in ``resampling_strategy_arguments``</span>
296-
<span class="sd"> * ``BaseCrossValidator`` subclass:</span>
297-
<span class="sd"> any BaseCrossValidator subclass (found in scikit-learn model_selection module)</span>
298-
<span class="sd"> * ``_RepeatedSplits`` subclass:</span>
299-
<span class="sd"> any _RepeatedSplits subclass (found in scikit-learn model_selection module)</span>
300-
<span class="sd"> * ``BaseShuffleSplit`` subclass:</span>
301-
<span class="sd"> any BaseShuffleSplit subclass (found in scikit-learn model_selection module)</span>
302-
303-
<span class="sd"> resampling_strategy_arguments : dict, optional if &#39;holdout&#39; (train_size default=0.67)</span>
304-
<span class="sd"> Additional arguments for resampling_strategy:</span>
305-
306-
<span class="sd"> * ``train_size`` should be between 0.0 and 1.0 and represent the</span>
307-
<span class="sd"> proportion of the dataset to include in the train split.</span>
308-
<span class="sd"> * ``shuffle`` determines whether the data is shuffled prior to</span>
309-
<span class="sd"> splitting it into train and validation.</span>
310-
311-
<span class="sd"> Available arguments:</span>
312-
313-
<span class="sd"> * &#39;holdout&#39;: {&#39;train_size&#39;: float}</span>
314-
<span class="sd"> * &#39;holdout-iterative-fit&#39;: {&#39;train_size&#39;: float}</span>
315-
<span class="sd"> * &#39;cv&#39;: {&#39;folds&#39;: int}</span>
316-
<span class="sd"> * &#39;cv-iterative-fit&#39;: {&#39;folds&#39;: int}</span>
317-
<span class="sd"> * &#39;partial-cv&#39;: {&#39;folds&#39;: int, &#39;shuffle&#39;: bool}</span>
318-
<span class="sd"> * BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments</span>
319-
<span class="sd"> required by chosen class as specified in scikit-learn documentation.</span>
320-
<span class="sd"> If arguments are not provided, scikit-learn defaults are used.</span>
321-
<span class="sd"> If no defaults are available, an exception is raised.</span>
322-
<span class="sd"> Refer to the &#39;n_splits&#39; argument as &#39;folds&#39;.</span>
291+
<span class="sd"> resampling_strategy_arguments : Optional[Dict]</span>
292+
<span class="sd"> Additional arguments for ``resampling_strategy``, this is required if</span>
293+
<span class="sd"> using a ``cv`` based strategy:</span>
294+
295+
<span class="sd"> .. code-block:: python</span>
296+
297+
<span class="sd"> {</span>
298+
<span class="sd"> &quot;train_size&quot;: 0.67, # The size of the training set</span>
299+
<span class="sd"> &quot;shuffle&quot;: True, # Whether to shuffle before splitting data</span>
300+
<span class="sd"> &quot;folds&quot;: 5 # Used in &#39;cv&#39; based resampling strategies</span>
301+
<span class="sd"> }</span>
302+
303+
<span class="sd"> If using a custom splitter class, which takes ``n_splits`` such as</span>
304+
<span class="sd"> `PredefinedSplit &lt;https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold&gt;`_, the value of ``&quot;folds&quot;`` will be used.</span>
323305

324306
<span class="sd"> tmp_folder : string, optional (None)</span>
325307
<span class="sd"> folder to store configuration output and log files, if ``None``</span>
@@ -331,12 +313,12 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
331313

332314
<span class="sd"> n_jobs : int, optional, experimental</span>
333315
<span class="sd"> The number of jobs to run in parallel for ``fit()``. ``-1`` means</span>
334-
<span class="sd"> using all processors. </span>
335-
<span class="sd"> </span>
336-
<span class="sd"> **Important notes**: </span>
337-
<span class="sd"> </span>
338-
<span class="sd"> * By default, Auto-sklearn uses one core. </span>
339-
<span class="sd"> * Ensemble building is not affected by ``n_jobs`` but can be controlled by the number </span>
316+
<span class="sd"> using all processors.</span>
317+
318+
<span class="sd"> **Important notes**:</span>
319+
320+
<span class="sd"> * By default, Auto-sklearn uses one core.</span>
321+
<span class="sd"> * Ensemble building is not affected by ``n_jobs`` but can be controlled by the number</span>
340322
<span class="sd"> of models in the ensemble.</span>
341323
<span class="sd"> * ``predict()`` is not affected by ``n_jobs`` (in contrast to most scikit-learn models)</span>
342324
<span class="sd"> * If ``dask_client`` is ``None``, a new dask client is created.</span>
@@ -400,16 +382,14 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
400382

401383
<span class="sd"> dataset_compression: Union[bool, Mapping[str, Any]] = True</span>
402384
<span class="sd"> We compress datasets so that they fit into some predefined amount of memory.</span>
403-
<span class="sd"> Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.</span>
385+
<span class="sd"> Currently this does not apply to dataframes or sparse arrays, only to raw</span>
386+
<span class="sd"> numpy arrays.</span>
404387

405-
<span class="sd"> **NOTE**</span>
406-
407-
<span class="sd"> If using a custom ``resampling_strategy`` that relies on specific</span>
388+
<span class="sd"> **NOTE** - If using a custom ``resampling_strategy`` that relies on specific</span>
408389
<span class="sd"> size or ordering of data, this must be disabled to preserve these properties.</span>
409390

410-
<span class="sd"> You can disable this entirely by passing ``False``.</span>
411-
412-
<span class="sd"> Default configuration when left as ``True``:</span>
391+
<span class="sd"> You can disable this entirely by passing ``False`` or leave as the default</span>
392+
<span class="sd"> ``True`` for configuration below.</span>
413393

414394
<span class="sd"> .. code-block:: python</span>
415395

@@ -423,36 +403,36 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
423403

424404
<span class="sd"> The available options are described here:</span>
425405

426-
<span class="sd"> **memory_allocation**</span>
427-
428-
<span class="sd"> By default, we attempt to fit the dataset into ``0.1 * memory_limit``. This</span>
429-
<span class="sd"> float value can be set with ``&quot;memory_allocation&quot;: 0.1``. We also allow for</span>
430-
<span class="sd"> specifying absolute memory in MB, e.g. 10MB is ``&quot;memory_allocation&quot;: 10``.</span>
431-
432-
<span class="sd"> The memory used by the dataset is checked after each reduction method is</span>
433-
<span class="sd"> performed. If the dataset fits into the allocated memory, any further methods</span>
434-
<span class="sd"> listed in ``&quot;methods&quot;`` will not be performed.</span>
435-
436-
<span class="sd"> For example, if ``methods: [&quot;precision&quot;, &quot;subsample&quot;]`` and the</span>
437-
<span class="sd"> ``&quot;precision&quot;`` reduction step was enough to make the dataset fit into memory,</span>
438-
<span class="sd"> then the ``&quot;subsample&quot;`` reduction step will not be performed.</span>
439-
440-
<span class="sd"> **methods**</span>
441-
442-
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
443-
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
444-
445-
<span class="sd"> * ``&quot;precision&quot;`` - We reduce floating point precision as follows:</span>
446-
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
447-
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
448-
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
449-
450-
<span class="sd"> * ``subsample`` - We subsample data such that it **fits directly into the</span>
451-
<span class="sd"> memory allocation** ``memory_allocation * memory_limit``. Therefore, this</span>
452-
<span class="sd"> should likely be the last method listed in ``&quot;methods&quot;``.</span>
453-
<span class="sd"> Subsampling takes into account classification labels and stratifies</span>
454-
<span class="sd"> accordingly. We guarantee that at least one occurrence of each label is</span>
455-
<span class="sd"> included in the sampled set.</span>
406+
<span class="sd"> * **memory_allocation**</span>
407+
<span class="sd"> By default, we attempt to fit the dataset into ``0.1 * memory_limit``.</span>
408+
<span class="sd"> This float value can be set with ``&quot;memory_allocation&quot;: 0.1``.</span>
409+
<span class="sd"> We also allow for specifying absolute memory in MB, e.g. 10MB is</span>
410+
<span class="sd"> ``&quot;memory_allocation&quot;: 10``.</span>
411+
412+
<span class="sd"> The memory used by the dataset is checked after each reduction method is</span>
413+
<span class="sd"> performed. If the dataset fits into the allocated memory, any further</span>
414+
<span class="sd"> methods listed in ``&quot;methods&quot;`` will not be performed.</span>
415+
416+
<span class="sd"> For example, if ``methods: [&quot;precision&quot;, &quot;subsample&quot;]`` and the</span>
417+
<span class="sd"> ``&quot;precision&quot;`` reduction step was enough to make the dataset fit into</span>
418+
<span class="sd"> memory, then the ``&quot;subsample&quot;`` reduction step will not be performed.</span>
419+
420+
<span class="sd"> * **methods**</span>
421+
<span class="sd"> We provide the following methods for reducing the dataset size.</span>
422+
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
423+
424+
<span class="sd"> * ``&quot;precision&quot;`` - We reduce floating point precision as follows:</span>
425+
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
426+
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
427+
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
428+
429+
<span class="sd"> * ``subsample`` - We subsample data such that it **fits directly into</span>
430+
<span class="sd"> the memory allocation** ``memory_allocation * memory_limit``.</span>
431+
<span class="sd"> Therefore, this should likely be the last method listed in</span>
432+
<span class="sd"> ``&quot;methods&quot;``.</span>
433+
<span class="sd"> Subsampling takes into account classification labels and stratifies</span>
434+
<span class="sd"> accordingly. We guarantee that at least one occurrence of each</span>
435+
<span class="sd"> label is included in the sampled set.</span>
456436

457437
<span class="sd"> Attributes</span>
458438
<span class="sd"> ----------</span>

development/_modules/autosklearn/experimental/askl2.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@
6363
<li><a href="../../../index.html">Start</a></li>
6464
<li><a href="../../../releases.html">Releases</a></li>
6565
<li><a href="../../../installation.html">Installation</a></li>
66+
<li><a href="../../../manual.html">Manual</a></li>
6667
<li><a href="../../../examples/index.html">Examples</a></li>
6768
<li><a href="../../../api.html">API</a></li>
6869
<li><a href="../../../extending.html">Extending</a></li>

0 commit comments

Comments
 (0)