.. _related_projects:
=====================================
Related Projects
=====================================
Projects implementing the scikit-learn estimator API are encouraged to use
the `scikit-learn-contrib template `_
which facilitates best practices for testing and documenting estimators.
The `scikit-learn-contrib GitHub organisation `_
also accepts high-quality contributions of repositories conforming to this
template.
Below is a list of sister-projects, extensions and domain specific packages.
Interoperability and framework enhancements
-------------------------------------------
These tools adapt scikit-learn for use with other technologies or otherwise
enhance the functionality of scikit-learn's estimators.
**Data formats**
- `sklearn_pandas `_ bridge for
scikit-learn pipelines and pandas data frame with dedicated transformers.
- `sklearn_xarray `_ provides
compatibility of scikit-learn estimators with xarray data structures.
**Auto-ML**
- `auto_ml `_
Automated machine learning for production and analytics, built on scikit-learn
and related projects. Trains a pipeline wth all the standard machine learning
steps. Tuned for prediction speed and ease of transfer to production environments.
- `auto-sklearn `_
An automated machine learning toolkit and a drop-in replacement for a
scikit-learn estimator
- `TPOT `_
An automated machine learning toolkit that optimizes a series of scikit-learn
operators to design a machine learning pipeline, including data and feature
preprocessors as well as the estimators. Works as a drop-in replacement for a
scikit-learn estimator.
- `scikit-optimize `_
A library to minimize (very) expensive and noisy black-box functions. It
implements several methods for sequential model-based optimization, and
includes a replacement for ``GridSearchCV`` or ``RandomizedSearchCV`` to do
cross-validated parameter search using any of these strategies.
**Experimentation frameworks**
- `REP `_ Environment for conducting data-driven
research in a consistent and reproducible way
- `ML Frontend `_ provides
dataset management and SVM fitting/prediction through
`web-based `_
and `programmatic `_
interfaces.
- `Scikit-Learn Laboratory
`_ A command-line
wrapper around scikit-learn that makes it easy to run machine learning
experiments with multiple learners and large feature sets.
- `Xcessiv `_ is a notebook-like
application for quick, scalable, and automated hyperparameter tuning
and stacked ensembling. Provides a framework for keeping track of
model-hyperparameter combinations.
**Model inspection and visualisation**
- `eli5 `_ A library for
debugging/inspecting machine learning models and explaining their
predictions.
- `mlxtend `_ Includes model visualization
utilities.
- `scikit-plot `_ A visualization library
for quick and easy generation of common plots in data analysis and machine learning.
- `yellowbrick `_ A suite of
custom matplotlib visualizers for scikit-learn estimators to support visual feature
analysis, model selection, evaluation, and diagnostics.
**Model export for production**
- `onnxmltools `_ Serializes many
Scikit-learn pipelines to `ONNX `_ for interchange and
prediction.
- `sklearn2pmml `_
Serialization of a wide variety of scikit-learn estimators and transformers
into PMML with the help of `JPMML-SkLearn `_
library.
- `sklearn-porter `_
Transpile trained scikit-learn models to C, Java, Javascript and others.
- `sklearn-compiledtrees `_
Generate a C++ implementation of the predict function for decision trees (and
ensembles) trained by sklearn. Useful for latency-sensitive production
environments.
Other estimators and tasks
--------------------------
Not everything belongs or is mature enough for the central scikit-learn
project. The following are projects providing interfaces similar to
scikit-learn for additional learning algorithms, infrastructures
and tasks.
**Structured learning**
- `Seqlearn `_ Sequence classification
using HMMs or structured perceptron.
- `HMMLearn `_ Implementation of hidden
markov models that was previously part of scikit-learn.
- `PyStruct `_ General conditional random fields
and structured prediction.
- `pomegranate `_ Probabilistic modelling
for Python, with an emphasis on hidden Markov models.
- `sklearn-crfsuite `_
Linear-chain conditional random fields
(`CRFsuite `_ wrapper with
sklearn-like API).
**Deep neural networks etc.**
- `pylearn2 `_ A deep learning and
neural network library build on theano with scikit-learn like interface.
- `sklearn_theano `_ scikit-learn compatible
estimators, transformers, and datasets which use Theano internally
- `nolearn `_ A number of wrappers and
abstractions around existing neural network libraries
- `keras `_ Deep Learning library capable of
running on top of either TensorFlow or Theano.
- `lasagne `_ A lightweight library to
build and train neural networks in Theano.
- `skorch `_ A scikit-learn compatible
neural network library that wraps PyTorch.
**Broad scope**
- `mlxtend `_ Includes a number of additional
estimators as well as model visualization utilities.
- `sparkit-learn `_ Scikit-learn
API and functionality for PySpark's distributed modelling.
**Other regression and classification**
- `xgboost `_ Optimised gradient boosted decision
tree library.
- `ML-Ensemble `_ Generalized
ensemble learning (stacking, blending, subsemble, deep ensembles,
etc.).
- `lightning `_ Fast
state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc...).
- `py-earth `_ Multivariate
adaptive regression splines
- `Kernel Regression `_
Implementation of Nadaraya-Watson kernel regression with automatic bandwidth
selection
- `gplearn `_ Genetic Programming
for symbolic regression tasks.
- `multiisotonic `_ Isotonic
regression on multidimensional features.
- `scikit-multilearn `_ Multi-label classification with
focus on label space manipulation.
- `seglearn `_ Time series and sequence
learning using sliding window segmentation.
**Decomposition and clustering**
- `lda `_: Fast implementation of latent
Dirichlet allocation in Cython which uses `Gibbs sampling
`_ to sample from the true
posterior distribution. (scikit-learn's
:class:`sklearn.decomposition.LatentDirichletAllocation` implementation uses
`variational inference
`_ to sample from
a tractable approximation of a topic model's posterior distribution.)
- `Sparse Filtering `_
Unsupervised feature learning based on sparse-filtering
- `kmodes `_ k-modes clustering algorithm for
categorical data, and several of its variations.
- `hdbscan `_ HDBSCAN and Robust Single
Linkage clustering algorithms for robust variable density clustering.
- `spherecluster `_ Spherical
K-means and mixture of von Mises Fisher clustering routines for data on the
unit hypersphere.
**Pre-processing**
- `categorical-encoding
`_ A
library of sklearn compatible categorical variable encoders.
- `imbalanced-learn
`_ Various
methods to under- and over-sample datasets.
Statistical learning with Python
--------------------------------
Other packages useful for data analysis and machine learning.
- `Pandas `_ Tools for working with heterogeneous and
columnar data, relational queries, time series and basic statistics.
- `theano `_ A CPU/GPU array
processing framework geared towards deep learning research.
- `statsmodels `_ Estimating and analysing
statistical models. More focused on statistical tests and less on prediction
than scikit-learn.
- `PyMC `_ Bayesian statistical models and
fitting algorithms.
- `Sacred `_ Tool to help you configure,
organize, log and reproduce experiments
- `Seaborn `_ Visualization library based on
matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- `Deep Learning `_ A curated list of deep learning
software libraries.
Domain specific packages
~~~~~~~~~~~~~~~~~~~~~~~~
- `scikit-image `_ Image processing and computer
vision in python.
- `Natural language toolkit (nltk) `_ Natural language
processing and some machine learning.
- `gensim `_ A library for topic modelling,
document indexing and similarity retrieval
- `NiLearn `_ Machine learning for neuro-imaging.
- `AstroML `_ Machine learning for astronomy.
- `MSMBuilder `_ Machine learning for protein
conformational dynamics time series.
- `scikit-surprise `_ A scikit for building and
evaluating recommender systems.
Snippets and tidbits
---------------------
The `wiki `_ has more!