Version 0.15.2¶
September 4, 2014
Bug fixes¶
- Fixed handling of the
p
parameter of the Minkowski distance that was previously ignored in nearest neighbors models. By Nikolay Mayorov. - Fixed duplicated alphas in
linear_model.LassoLars
with early stopping on 32 bit Python. By Olivier Grisel and Fabian Pedregosa. - Fixed the build under Windows when scikit-learn is built with MSVC while NumPy is built with MinGW. By Olivier Grisel and Federico Vaggi.
- Fixed an array index overflow bug in the coordinate descent solver. By Gael Varoquaux.
- Better handling of numpy 1.9 deprecation warnings. By Gael Varoquaux.
- Removed unnecessary data copy in
cluster.KMeans
. By Gael Varoquaux. - Explicitly close open files to avoid
ResourceWarnings
under Python 3. By Calvin Giles. - The
transform
ofdiscriminant_analysis.LinearDiscriminantAnalysis
now projects the input on the most discriminant directions. By Martin Billinger. - Fixed potential overflow in
_tree.safe_realloc
by Lars Buitinck. - Performance optimization in
isotonic.IsotonicRegression
. By Robert Bradshaw. nose
is non-longer a runtime dependency to importsklearn
, only for running the tests. By Joel Nothman.- Many documentation and website fixes by Joel Nothman, Lars Buitinck Matt Pico, and others.
Version 0.15.1¶
August 1, 2014
Bug fixes¶
- Made
cross_validation.cross_val_score
usecross_validation.KFold
instead ofcross_validation.StratifiedKFold
on multi-output classification problems. By Nikolay Mayorov. - Support unseen labels
preprocessing.LabelBinarizer
to restore the default behavior of 0.14.1 for backward compatibility. By Hamzeh Alsalhi. - Fixed the
cluster.KMeans
stopping criterion that prevented early convergence detection. By Edward Raff and Gael Varoquaux. - Fixed the behavior of
multiclass.OneVsOneClassifier
. in case of ties at the per-class vote level by computing the correct per-class sum of prediction scores. By Andreas Müller. - Made
cross_validation.cross_val_score
andgrid_search.GridSearchCV
accept Python lists as input data. This is especially useful for cross-validation and model selection of text processing pipelines. By Andreas Müller. - Fixed data input checks of most estimators to accept input data that
implements the NumPy
__array__
protocol. This is the case for forpandas.Series
andpandas.DataFrame
in recent versions of pandas. By Gael Varoquaux. - Fixed a regression for
linear_model.SGDClassifier
withclass_weight="auto"
on data with non-contiguous labels. By Olivier Grisel.
Version 0.15¶
July 15, 2014
Highlights¶
- Many speed and memory improvements all across the code
- Huge speed and memory improvements to random forests (and extra trees) that also benefit better from parallel computing.
- Incremental fit to
BernoulliRBM
- Added
cluster.AgglomerativeClustering
for hierarchical agglomerative clustering with average linkage, complete linkage and ward strategies. - Added
linear_model.RANSACRegressor
for robust regression models. - Added dimensionality reduction with
manifold.TSNE
which can be used to visualize high-dimensional data.
Changelog¶
New features¶
- Added
ensemble.BaggingClassifier
andensemble.BaggingRegressor
meta-estimators for ensembling any kind of base estimator. See the Bagging section of the user guide for details and examples. By Gilles Louppe. - New unsupervised feature selection algorithm
feature_selection.VarianceThreshold
, by Lars Buitinck. - Added
linear_model.RANSACRegressor
meta-estimator for the robust fitting of regression models. By Johannes Schönberger. - Added
cluster.AgglomerativeClustering
for hierarchical agglomerative clustering with average linkage, complete linkage and ward strategies, by Nelle Varoquaux and Gael Varoquaux. - Shorthand constructors
pipeline.make_pipeline
andpipeline.make_union
were added by Lars Buitinck. - Shuffle option for
cross_validation.StratifiedKFold
. By Jeffrey Blackburne. - Incremental learning (
partial_fit
) for Gaussian Naive Bayes by Imran Haque. - Added
partial_fit
toBernoulliRBM
By Danny Sullivan. - Added
learning_curve
utility to chart performance with respect to training size. See Plotting Learning Curves. By Alexander Fabisch. - Add positive option in
LassoCV
andElasticNetCV
. By Brian Wignall and Alexandre Gramfort. - Added
linear_model.MultiTaskElasticNetCV
andlinear_model.MultiTaskLassoCV
. By Manoj Kumar. - Added
manifold.TSNE
. By Alexander Fabisch.
Enhancements¶
- Add sparse input support to
ensemble.AdaBoostClassifier
andensemble.AdaBoostRegressor
meta-estimators. By Hamzeh Alsalhi. - Memory improvements of decision trees, by Arnaud Joly.
- Decision trees can now be built in best-first manner by using
max_leaf_nodes
as the stopping criteria. Refactored the tree code to use either a stack or a priority queue for tree building. By Peter Prettenhofer and Gilles Louppe. - Decision trees can now be fitted on fortran- and c-style arrays, and
non-continuous arrays without the need to make a copy.
If the input array has a different dtype than
np.float32
, a fortran- style copy will be made since fortran-style memory layout has speed advantages. By Peter Prettenhofer and Gilles Louppe. - Speed improvement of regression trees by optimizing the the computation of the mean square error criterion. This lead to speed improvement of the tree, forest and gradient boosting tree modules. By Arnaud Joly
- The
img_to_graph
andgrid_tograph
functions insklearn.feature_extraction.image
now returnnp.ndarray
instead ofnp.matrix
whenreturn_as=np.ndarray
. See the Notes section for more information on compatibility. - Changed the internal storage of decision trees to use a struct array. This fixed some small bugs, while improving code and providing a small speed gain. By Joel Nothman.
- Reduce memory usage and overhead when fitting and predicting with forests
of randomized trees in parallel with
n_jobs != 1
by leveraging new threading backend of joblib 0.8 and releasing the GIL in the tree fitting Cython code. By Olivier Grisel and Gilles Louppe. - Speed improvement of the
sklearn.ensemble.gradient_boosting
module. By Gilles Louppe and Peter Prettenhofer. - Various enhancements to the
sklearn.ensemble.gradient_boosting
module: awarm_start
argument to fit additional trees, amax_leaf_nodes
argument to fit GBM style trees, amonitor
fit argument to inspect the estimator during training, and refactoring of the verbose code. By Peter Prettenhofer. - Faster
sklearn.ensemble.ExtraTrees
by caching feature values. By Arnaud Joly. - Faster depth-based tree building algorithm such as decision tree, random forest, extra trees or gradient tree boosting (with depth based growing strategy) by avoiding trying to split on found constant features in the sample subset. By Arnaud Joly.
- Add
min_weight_fraction_leaf
pre-pruning parameter to tree-based methods: the minimum weighted fraction of the input samples required to be at a leaf node. By Noel Dawe. - Added
metrics.pairwise_distances_argmin_min
, by Philippe Gervais. - Added predict method to
cluster.AffinityPropagation
andcluster.MeanShift
, by Mathieu Blondel. - Vector and matrix multiplications have been optimised throughout the library by Denis Engemann, and Alexandre Gramfort. In particular, they should take less memory with older NumPy versions (prior to 1.7.2).
- Precision-recall and ROC examples now use train_test_split, and have more explanation of why these metrics are useful. By Kyle Kastner
- The training algorithm for
decomposition.NMF
is faster for sparse matrices and has much lower memory complexity, meaning it will scale up gracefully to large datasets. By Lars Buitinck. - Added svd_method option with default value to “randomized” to
decomposition.FactorAnalysis
to save memory and significantly speedup computation by Denis Engemann, and Alexandre Gramfort. - Changed
cross_validation.StratifiedKFold
to try and preserve as much of the original ordering of samples as possible so as not to hide overfitting on datasets with a non-negligible level of samples dependency. By Daniel Nouri and Olivier Grisel. - Add multi-output support to
gaussian_process.GaussianProcess
by John Novak. - Support for precomputed distance matrices in nearest neighbor estimators by Robert Layton and Joel Nothman.
- Norm computations optimized for NumPy 1.6 and later versions by Lars Buitinck. In particular, the k-means algorithm no longer needs a temporary data structure the size of its input.
dummy.DummyClassifier
can now be used to predict a constant output value. By Manoj Kumar.dummy.DummyRegressor
has now a strategy parameter which allows to predict the mean, the median of the training set or a constant output value. By Maheshakya Wijewardena.- Multi-label classification output in multilabel indicator format
is now supported by
metrics.roc_auc_score
andmetrics.average_precision_score
by Arnaud Joly. - Significant performance improvements (more than 100x speedup for
large problems) in
isotonic.IsotonicRegression
by Andrew Tulloch. - Speed and memory usage improvements to the SGD algorithm for linear
models: it now uses threads, not separate processes, when
n_jobs>1
. By Lars Buitinck. - Grid search and cross validation allow NaNs in the input arrays so that
preprocessors such as
preprocessing.Imputer
can be trained within the cross validation loop, avoiding potentially skewed results. - Ridge regression can now deal with sample weights in feature space (only sample space until then). By Michael Eickenberg. Both solutions are provided by the Cholesky solver.
- Several classification and regression metrics now support weighted
samples with the new
sample_weight
argument:metrics.accuracy_score
,metrics.zero_one_loss
,metrics.precision_score
,metrics.average_precision_score
,metrics.f1_score
,metrics.fbeta_score
,metrics.recall_score
,metrics.roc_auc_score
,metrics.explained_variance_score
,metrics.mean_squared_error
,metrics.mean_absolute_error
,metrics.r2_score
. By Noel Dawe. - Speed up of the sample generator
datasets.make_multilabel_classification
. By Joel Nothman.
Documentation improvements¶
- The Working With Text Data tutorial has now been worked in to the main documentation’s tutorial section. Includes exercises and skeletons for tutorial presentation. Original tutorial created by several authors including Olivier Grisel, Lars Buitinck and many others. Tutorial integration into the scikit-learn documentation by Jaques Grobler
- Added Computational Performance documentation. Discussion and examples of prediction latency / throughput and different factors that have influence over speed. Additional tips for building faster models and choosing a relevant compromise between speed and predictive power. By Eustache Diemert.
Bug fixes¶
- Fixed bug in
decomposition.MiniBatchDictionaryLearning
:partial_fit
was not working properly. - Fixed bug in
linear_model.stochastic_gradient
:l1_ratio
was used as(1.0 - l1_ratio)
. - Fixed bug in
multiclass.OneVsOneClassifier
with string labels - Fixed a bug in
LassoCV
andElasticNetCV
: they would not pre-compute the Gram matrix withprecompute=True
orprecompute="auto"
andn_samples > n_features
. By Manoj Kumar. - Fixed incorrect estimation of the degrees of freedom in
feature_selection.f_regression
when variates are not centered. By Virgile Fritsch. - Fixed a race condition in parallel processing with
pre_dispatch != "all"
(for instance, incross_val_score
). By Olivier Grisel. - Raise error in
cluster.FeatureAgglomeration
andcluster.WardAgglomeration
when no samples are given, rather than returning meaningless clustering. - Fixed bug in
gradient_boosting.GradientBoostingRegressor
withloss='huber'
:gamma
might have not been initialized. - Fixed feature importances as computed with a forest of randomized trees
when fit with
sample_weight != None
and/or withbootstrap=True
. By Gilles Louppe.
API changes summary¶
sklearn.hmm
is deprecated. Its removal is planned for the 0.17 release.- Use of
covariance.EllipticEnvelop
has now been removed after deprecation. Please usecovariance.EllipticEnvelope
instead. cluster.Ward
is deprecated. Usecluster.AgglomerativeClustering
instead.cluster.WardClustering
is deprecated. Usecluster.AgglomerativeClustering
instead.cross_validation.Bootstrap
is deprecated.cross_validation.KFold
orcross_validation.ShuffleSplit
are recommended instead.- Direct support for the sequence of sequences (or list of lists) multilabel
format is deprecated. To convert to and from the supported binary
indicator matrix format, use
MultiLabelBinarizer
. By Joel Nothman. - Add score method to
PCA
following the model of probabilistic PCA and deprecateProbabilisticPCA
model whose score implementation is not correct. The computation now also exploits the matrix inversion lemma for faster computation. By Alexandre Gramfort. - The score method of
FactorAnalysis
now returns the average log-likelihood of the samples. Use score_samples to get log-likelihood of each sample. By Alexandre Gramfort. - Generating boolean masks (the setting
indices=False
) from cross-validation generators is deprecated. Support for masks will be removed in 0.17. The generators have produced arrays of indices by default since 0.10. By Joel Nothman. - 1-d arrays containing strings with
dtype=object
(as used in Pandas) are now considered valid classification targets. This fixes a regression from version 0.13 in some classifiers. By Joel Nothman. - Fix wrong
explained_variance_ratio_
attribute inRandomizedPCA
. By Alexandre Gramfort. - Fit alphas for each
l1_ratio
instead ofmean_l1_ratio
inlinear_model.ElasticNetCV
andlinear_model.LassoCV
. This changes the shape ofalphas_
from(n_alphas,)
to(n_l1_ratio, n_alphas)
if thel1_ratio
provided is a 1-D array like object of length greater than one. By Manoj Kumar. - Fix
linear_model.ElasticNetCV
andlinear_model.LassoCV
when fitting intercept and input data is sparse. The automatic grid of alphas was not computed correctly and the scaling with normalize was wrong. By Manoj Kumar. - Fix wrong maximal number of features drawn (
max_features
) at each split for decision trees, random forests and gradient tree boosting. Previously, the count for the number of drawn features started only after one non constant features in the split. This bug fix will affect computational and generalization performance of those algorithms in the presence of constant features. To get back previous generalization performance, you should modify the value ofmax_features
. By Arnaud Joly. - Fix wrong maximal number of features drawn (
max_features
) at each split forensemble.ExtraTreesClassifier
andensemble.ExtraTreesRegressor
. Previously, only non constant features in the split was counted as drawn. Now constant features are counted as drawn. Furthermore at least one feature must be non constant in order to make a valid split. This bug fix will affect computational and generalization performance of extra trees in the presence of constant features. To get back previous generalization performance, you should modify the value ofmax_features
. By Arnaud Joly. - Fix
utils.compute_class_weight
whenclass_weight=="auto"
. Previously it was broken for input of non-integerdtype
and the weighted array that was returned was wrong. By Manoj Kumar. - Fix
cross_validation.Bootstrap
to returnValueError
whenn_train + n_test > n
. By Ronald Phlypo.
People¶
List of contributors for release 0.15 by number of commits.
- 312 Olivier Grisel
- 275 Lars Buitinck
- 221 Gael Varoquaux
- 148 Arnaud Joly
- 134 Johannes Schönberger
- 119 Gilles Louppe
- 113 Joel Nothman
- 111 Alexandre Gramfort
- 95 Jaques Grobler
- 89 Denis Engemann
- 83 Peter Prettenhofer
- 83 Alexander Fabisch
- 62 Mathieu Blondel
- 60 Eustache Diemert
- 60 Nelle Varoquaux
- 49 Michael Bommarito
- 45 Manoj-Kumar-S
- 28 Kyle Kastner
- 26 Andreas Mueller
- 22 Noel Dawe
- 21 Maheshakya Wijewardena
- 21 Brooke Osborn
- 21 Hamzeh Alsalhi
- 21 Jake VanderPlas
- 21 Philippe Gervais
- 19 Bala Subrahmanyam Varanasi
- 12 Ronald Phlypo
- 10 Mikhail Korobov
- 8 Thomas Unterthiner
- 8 Jeffrey Blackburne
- 8 eltermann
- 8 bwignall
- 7 Ankit Agrawal
- 7 CJ Carey
- 6 Daniel Nouri
- 6 Chen Liu
- 6 Michael Eickenberg
- 6 ugurthemaster
- 5 Aaron Schumacher
- 5 Baptiste Lagarde
- 5 Rajat Khanduja
- 5 Robert McGibbon
- 5 Sergio Pascual
- 4 Alexis Metaireau
- 4 Ignacio Rossi
- 4 Virgile Fritsch
- 4 Sebastian Säger
- 4 Ilambharathi Kanniah
- 4 sdenton4
- 4 Robert Layton
- 4 Alyssa
- 4 Amos Waterland
- 3 Andrew Tulloch
- 3 murad
- 3 Steven Maude
- 3 Karol Pysniak
- 3 Jacques Kvam
- 3 cgohlke
- 3 cjlin
- 3 Michael Becker
- 3 hamzeh
- 3 Eric Jacobsen
- 3 john collins
- 3 kaushik94
- 3 Erwin Marsi
- 2 csytracy
- 2 LK
- 2 Vlad Niculae
- 2 Laurent Direr
- 2 Erik Shilts
- 2 Raul Garreta
- 2 Yoshiki Vázquez Baeza
- 2 Yung Siang Liau
- 2 abhishek thakur
- 2 James Yu
- 2 Rohit Sivaprasad
- 2 Roland Szabo
- 2 amormachine
- 2 Alexis Mignon
- 2 Oscar Carlsson
- 2 Nantas Nardelli
- 2 jess010
- 2 kowalski87
- 2 Andrew Clegg
- 2 Federico Vaggi
- 2 Simon Frid
- 2 Félix-Antoine Fortin
- 1 Ralf Gommers
- 1 t-aft
- 1 Ronan Amicel
- 1 Rupesh Kumar Srivastava
- 1 Ryan Wang
- 1 Samuel Charron
- 1 Samuel St-Jean
- 1 Fabian Pedregosa
- 1 Skipper Seabold
- 1 Stefan Walk
- 1 Stefan van der Walt
- 1 Stephan Hoyer
- 1 Allen Riddell
- 1 Valentin Haenel
- 1 Vijay Ramesh
- 1 Will Myers
- 1 Yaroslav Halchenko
- 1 Yoni Ben-Meshulam
- 1 Yury V. Zaytsev
- 1 adrinjalali
- 1 ai8rahim
- 1 alemagnani
- 1 alex
- 1 benjamin wilson
- 1 chalmerlowe
- 1 dzikie drożdże
- 1 jamestwebber
- 1 matrixorz
- 1 popo
- 1 samuela
- 1 François Boulogne
- 1 Alexander Measure
- 1 Ethan White
- 1 Guilherme Trein
- 1 Hendrik Heuer
- 1 IvicaJovic
- 1 Jan Hendrik Metzen
- 1 Jean Michel Rouly
- 1 Eduardo Ariño de la Rubia
- 1 Jelle Zijlstra
- 1 Eddy L O Jansson
- 1 Denis
- 1 John
- 1 John Schmidt
- 1 Jorge Cañardo Alastuey
- 1 Joseph Perla
- 1 Joshua Vredevoogd
- 1 José Ricardo
- 1 Julien Miotte
- 1 Kemal Eren
- 1 Kenta Sato
- 1 David Cournapeau
- 1 Kyle Kelley
- 1 Daniele Medri
- 1 Laurent Luce
- 1 Laurent Pierron
- 1 Luis Pedro Coelho
- 1 DanielWeitzenfeld
- 1 Craig Thompson
- 1 Chyi-Kwei Yau
- 1 Matthew Brett
- 1 Matthias Feurer
- 1 Max Linke
- 1 Chris Filo Gorgolewski
- 1 Charles Earl
- 1 Michael Hanke
- 1 Michele Orrù
- 1 Bryan Lunt
- 1 Brian Kearns
- 1 Paul Butler
- 1 Paweł Mandera
- 1 Peter
- 1 Andrew Ash
- 1 Pietro Zambelli
- 1 staubda