Purpose of this document¶
This document list general directions that core contributors are interested to see developed in scikit-learn. The fact that an item is listed here is in no way a promise that it will happen, as resources are limited. Rather, it is an indication that help is welcomed on this topic.
Statement of purpose: Scikit-learn in 2018¶
Eleven years after the inception of Scikit-learn, much has changed in the world of machine learning. Key changes include:
Computational tools: The exploitation of GPUs, distributed programming frameworks like Scala/Spark, etc.
High-level Python libraries for experimentation, processing and data management: Jupyter notebook, Cython, Pandas, Dask, Numba…
Changes in the focus of machine learning research: artificial intelligence applications (where input structure is key) with deep learning, representation learning, reinforcement learning, domain transfer, etc.
A more subtle change over the last decade is that, due to changing interests in ML, PhD students in machine learning are more likely to contribute to PyTorch, Dask, etc. than to Scikit-learn, so our contributor pool is very different to a decade ago.
Scikit-learn remains very popular in practice for trying out canonical machine learning techniques, particularly for applications in experimental science and in data science. A lot of what we provide is now very mature. But it can be costly to maintain, and we cannot therefore include arbitrary new implementations. Yet Scikit-learn is also essential in defining an API framework for the development of interoperable machine learning components external to the core library.
Thus our main goals in this era are to:
continue maintaining a high-quality, well-documented collection of canonical tools for data processing and machine learning within the current scope (i.e. rectangular data largely invariant to column and row order; predicting targets with simple structure)
improve the ease for users to develop and publish external components
improve interoperability with modern data science tools (e.g. Pandas, Dask) and infrastructures (e.g. distributed processing)
Many of the more fine-grained goals can be found under the API tag on the issue tracker.
Architectural / general goals¶
The list is numbered not as an indication of the order of priority, but to make referring to specific points easier. Please add new entries only at the bottom. Note that the crossed out entries are already done, and we try to keep the document up to date as we work on these issues.
Improved handling of Pandas DataFrames
Improved handling of categorical features
In dataset loaders #13902
Handling mixtures of categorical and continuous variables
Improved handling of missing data
More didactic documentation
More and more options have been added to scikit-learn. As a result, the documentation is crowded which makes it hard for beginners to get the big picture. Some work could be done in prioritizing the information.
Passing around information that is not (X, y): Sample properties
Passing around information that is not (X, y): Feature properties
Passing around information that is not (X, y): Target information
Make it easier for external users to write Scikit-learn-compatible components
Support resampling and sample reduction
Better interfaces for interactive development
Improved tools for model diagnostics and basic inference
Better tools for selecting hyperparameters with transductive estimators
Grid search and cross validation are not applicable to most clustering tasks. Stability-based selection is more relevant.
Better support for manual and automatic pipeline building
Improved tracking of fitting
Accept data which complies with
A way forward for more out of core
Dask enables easy out-of-core computation. While the Dask model probably cannot be adaptable to all machine-learning algorithms, most machine learning is on smaller data than ETL, hence we can maybe adapt to very large scale while supporting only a fraction of the patterns.
Support for working with pre-trained models
Backwards-compatible de/serialization of some estimators
Currently serialization (with pickle) breaks across versions. While we may not be able to get around other limitations of pickle re security etc, it would be great to offer cross-version safety from version 1.0. Note: Gael and Olivier think that this can cause heavy maintenance burden and we should manage the trade-offs. A possible alternative is presented in the following point.
Documentation and tooling for model lifecycle management
Document good practices for model deployments and lifecycle: before deploying a model: snapshot the code versions (numpy, scipy, scikit-learn, custom code repo), the training script and an alias on how to retrieve historical training data + snapshot a copy of a small validation set + snapshot of the predictions (predicted probabilities for classifiers) on that validation set.
Document and tools to make it easy to manage upgrade of scikit-learn versions:
Try to load the old pickle, if it works, use the validation set prediction snapshot to detect that the serialized model still behave the same;
If joblib.load / pickle.load not work, use the versioned control training script + historical training set to retrain the model and use the validation set prediction snapshot to assert that it is possible to recover the previous predictive performance: if this is not the case there is probably a bug in scikit-learn that needs to be reported.
Everything in Scikit-learn should probably conform to our API contract. We are still in the process of making decisions on some of these related issues.
(Optional) Improve scikit-learn common tests suite to make sure that (at least for frequently used) models have stable predictions across-versions (to be discussed);
Extend documentation to mention how to deploy models in Python-free environments for instance ONNX. and use the above best practices to assess predictive consistency between scikit-learn and ONNX prediction functions on validation set.
Document good practices to detect temporal distribution drift for deployed model and good practices for re-training on fresh data without causing catastrophic predictive performance regressions.
a stacking implementation, #11047
kmeans variants for non-Euclidean distances, if we can show these have benefits beyond hierarchical clustering.
multi-metric scoring is slow #9326
perhaps we want to be able to get back more than multiple metrics
the handling of random states in CV splitters is a poor design and contradicts the validation of similar parameters in estimators, SLEP011
exploit warm-starting and path algorithms so the benefits of
EstimatorCVobjects can be accessed via
GridSearchCVand used in Pipelines. #1626
Cross-validation should be able to be replaced by OOB estimates whenever a cross-validation iterator is used.
Redundant computations in pipelines should be avoided (related to point above) cf dask-ml
Ability to substitute a custom/approximate/precomputed nearest neighbors implementation for ours in all/most contexts that nearest neighbors are used for learning. #10463
Performance issues with
see “Everything in Scikit-learn should conform to our API contract” above