SageMaker Scikit-Learn Extension is a Python module for machine learning built on top of scikit-learn.
This project contains standalone scikit-learn estimators and additional tools to support SageMaker Autopilot. Many of the additional estimators are based on existing scikit-learn estimators.
To install,
# install from pip pip install sagemaker-scikit-learn-extension
In order to use the I/O functionalies in the sagemaker_sklearn_extension.externals module, you will also need to install the mlio version 0.7 package via conda. The mlio package is only available through conda at the moment.
To install mlio,
# install mlio conda install -c mlio -c conda-forge mlio-py==0.7
To see more information about mlio, see https://github.com/awslabs/ml-io.
You can also install from source by cloning this repository and running a pip install command in the root directory of the repository:
# install from source git clone https://github.com/aws/sagemaker-scikit-learn-extension.git cd sagemaker-scikit-learn-extension pip install -e .
SageMaker scikit-learn extension supports Unix/Linux and Mac.
SageMaker scikit-learn extension is tested on:
- Python 3.7
This library is licensed under the Apache 2.0 License.
We welcome contributions from developers of all experience levels.
The SageMaker scikit-learn extension is meant to be a repository for scikit-learn estimators that don't meet scikit-learn's stringent inclusion criteria.
We recommend using conda for development and testing.
To download conda, go to the conda installation guide.
SageMaker scikit-learn extension contains an extensive suite of unit tests.
You can install the libraries needed to run the tests by running pip install --upgrade .[test] or, for Zsh users: pip install --upgrade .\[test\]
For unit tests, tox will use pytest to run the unit tests in a Python 3.7 interpreter. tox will also run flake8 and pylint for style checks.
conda is needed because of the dependency on mlio 0.7.
To run the tests with tox, run:
tox
To use sagemaker-scikit-learn-extension on SageMaker, you can build the sagemaker-scikit-learn-extension-container.
sagemaker_sklearn_extension.decompositionRobustPCAdimension reduction for dense and sparse inputs
sagemaker_sklearn_extension.externalsAutoMLTransformerutility class encapsulating feature and target transformation functionality used in SageMaker AutopilotHeaderutility class to manage the header and target columns in tabular dataread_csv_datareads comma separated data and returns a numpy array (uses mlio)
sagemaker_sklearn_extension.feature_extraction.date_timeDateTimeVectorizerconvert datetime objects or strings into numeric features
sagemaker_sklearn_extension.feature_extraction.sequencesTSFlattenerconvert strings of sequences into numeric featuresTSFreshFeatureExtractorcompute row-wise time series features from a numpy array (uses tsfresh)
sagemaker_sklearn_extension.feature_extraction.textMultiColumnTfidfVectorizerconvert collections of raw documents to a matrix of TF-IDF features
sagemaker_sklearn_extension.imputeRobustImputerimputer for missing values with customizable mask_function and multi-column constant imputationRobustMissingIndicatorbinary indicator for missing values with customizable mask_function
sagemaker_sklearn_extension.preprocessingBaseExtremeValuesTransformercustomizable transformer for columns that contain "extreme" values (columns that are heavy tailed)LogExtremeValuesTransformerstateful log transformer for columns that contain "extreme" values (columns that are heavy tailed)NALabelEncoderencoder for transforming labels to NA valuesQuadraticFeaturesgenerate and add quadratic features to feature matrixQuantileExtremeValuesTransformerstateful quantiles transformer for columns that contain "extreme" values (columns that are heThresholdOneHotEncoderencode categorical integer features as a one-hot numeric array, with optional restrictions on feature encodingRemoveConstantColumnsTransformerremoves constant columnsRobustLabelEncoderencode labels for seen and unseen labelsRobustStandardScalerstandardization for dense and sparse inputsWOEEncoderweight of evidence supervised encoderSimilarityEncoderencode categorical values based on their descriptive string