Version 0.14#

August 7, 2013

Changelog#

Missing values with sparse and dense matrices can be imputed with the transformer preprocessing.Imputer by Nicolas Trésegnie.
The core implementation of decision trees has been rewritten from scratch, allowing for faster tree induction and lower memory consumption in all tree-based estimators. By Gilles Louppe.
Added ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor, by Noel Dawe and Gilles Louppe. See the AdaBoost section of the user guide for details and examples.
Added grid_search.RandomizedSearchCV and grid_search.ParameterSampler for randomized hyperparameter optimization. By Andreas Müller.
Added biclustering algorithms (sklearn.cluster.bicluster.SpectralCoclustering and sklearn.cluster.bicluster.SpectralBiclustering), data generation methods (sklearn.datasets.make_biclusters and sklearn.datasets.make_checkerboard), and scoring metrics (sklearn.metrics.consensus_score). By Kemal Eren.
Added Restricted Boltzmann Machines (neural_network.BernoulliRBM). By Yann Dauphin.
Python 3 support by Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel. All tests now pass under Python 3.3.
Ability to pass one penalty (alpha value) per target in linear_model.Ridge, by @eickenberg and Mathieu Blondel.
Fixed sklearn.linear_model.stochastic_gradient.py L2 regularization issue (minor practical significance). By Norbert Crombach and Mathieu Blondel .
Added an interactive version of Andreas Müller’s Machine Learning Cheat Sheet (for scikit-learn) to the documentation. See Choosing the right estimator. By Jaques Grobler.
grid_search.GridSearchCV and cross_validation.cross_val_score now support the use of advanced scoring functions such as area under the ROC curve and f-beta scores. See The scoring parameter: defining model evaluation rules for details. By Andreas Müller and Lars Buitinck. Passing a function from sklearn.metrics as score_func is deprecated.
Multi-label classification output is now supported by metrics.accuracy_score, metrics.zero_one_loss, metrics.f1_score, metrics.fbeta_score, metrics.classification_report, metrics.precision_score and metrics.recall_score by Arnaud Joly.
Two new metrics metrics.hamming_loss and metrics.jaccard_similarity_score are added with multi-label support by Arnaud Joly.
Speed and memory usage improvements in feature_extraction.text.CountVectorizer and feature_extraction.text.TfidfVectorizer, by Jochen Wersdörfer and Roman Sinayev.
The min_df parameter in feature_extraction.text.CountVectorizer and feature_extraction.text.TfidfVectorizer, which used to be 2, has been reset to 1 to avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A value of at least 2 is still recommended for practical use.
svm.LinearSVC, linear_model.SGDClassifier and linear_model.SGDRegressor now have a sparsify method that converts their coef_ into a sparse matrix, meaning stored models trained using these estimators can be made much more compact.
linear_model.SGDClassifier now produces multiclass probability estimates when trained under log loss or modified Huber loss.
Hyperlinks to documentation in example code on the website by Martin Luessi.
Fixed bug in preprocessing.MinMaxScaler causing incorrect scaling of the features for non-default feature_range settings. By Andreas Müller.
max_features in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor and all derived ensemble estimators now support percentage values. By Gilles Louppe.
Performance improvements in isotonic.IsotonicRegression by Nelle Varoquaux.
metrics.accuracy_score has an option normalize to return the fraction or the number of correctly classified samples by Arnaud Joly.
Added metrics.log_loss that computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars Buitinck.
A bug that caused ensemble.AdaBoostClassifier’s to output incorrect probabilities has been fixed.
Feature selectors now share a mixin providing consistent transform, inverse_transform and get_support methods. By Joel Nothman.
A fitted grid_search.GridSearchCV or grid_search.RandomizedSearchCV can now generally be pickled. By Joel Nothman.
Refactored and vectorized implementation of metrics.roc_curve and metrics.precision_recall_curve. By Joel Nothman.
The new estimator sklearn.decomposition.TruncatedSVD performs dimensionality reduction using SVD on sparse matrices, and can be used for latent semantic analysis (LSA). By Lars Buitinck.
Added self-contained example of out-of-core learning on text data Out-of-core classification of text documents. By Eustache Diemert.
The default number of components for sklearn.decomposition.RandomizedPCA is now correctly documented to be n_features. This was the default behavior, so programs using it will continue to work as they did.
sklearn.cluster.KMeans now fits several orders of magnitude faster on sparse data (the speedup depends on the sparsity). By Lars Buitinck.
Reduce memory footprint of FastICA by Denis Engemann and Alexandre Gramfort.
Verbose output in sklearn.ensemble.gradient_boosting now uses a column format and prints progress in decreasing frequency. It also shows the remaining time. By Peter Prettenhofer.
sklearn.ensemble.gradient_boosting provides out-of-bag improvement oob_improvement_ rather than the OOB score for model selection. An example that shows how to use OOB estimates to select the number of trees was added. By Peter Prettenhofer.
Most metrics now support string labels for multiclass classification by Arnaud Joly and Lars Buitinck.
New OrthogonalMatchingPursuitCV class by Alexandre Gramfort and Vlad Niculae.
Fixed a bug in sklearn.covariance.GraphLassoCV: the ‘alphas’ parameter now works as expected when given a list of values. By Philippe Gervais.
Fixed an important bug in sklearn.covariance.GraphLassoCV that prevented all folds provided by a CV object to be used (only the first 3 were used). When providing a CV object, execution time may thus increase significantly compared to the previous version (bug results are correct now). By Philippe Gervais.
cross_validation.cross_val_score and the grid_search module is now tested with multi-output data by Arnaud Joly.
datasets.make_multilabel_classification can now return the output in label indicator multilabel format by Arnaud Joly.
K-nearest neighbors, neighbors.KNeighborsRegressor and neighbors.RadiusNeighborsRegressor, and radius neighbors, neighbors.RadiusNeighborsRegressor and neighbors.RadiusNeighborsClassifier support multioutput data by Arnaud Joly.
Random state in LibSVM-based estimators (svm.SVC, svm.NuSVC, svm.OneClassSVM, svm.SVR, svm.NuSVR) can now be controlled. This is useful to ensure consistency in the probability estimates for the classifiers trained with probability=True. By Vlad Niculae.
Out-of-core learning support for discrete naive Bayes classifiers sklearn.naive_bayes.MultinomialNB and sklearn.naive_bayes.BernoulliNB by adding the partial_fit method by Olivier Grisel.
New website design and navigation by Gilles Louppe, Nelle Varoquaux, Vincent Michel and Andreas Müller.
Improved documentation on multi-class, multi-label and multi-output classification by Yannick Schwartz and Arnaud Joly.
Better input and error handling in the sklearn.metrics module by Arnaud Joly and Joel Nothman.
Speed optimization of the hmm module by Mikhail Korobov
Significant speed improvements for sklearn.cluster.DBSCAN by cleverless

API changes summary#

The auc_score was renamed metrics.roc_auc_score.
Testing scikit-learn with sklearn.test() is deprecated. Use nosetests sklearn from the command line.
Feature importances in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor and all derived ensemble estimators are now computed on the fly when accessing the feature_importances_ attribute. Setting compute_importances=True is no longer required. By Gilles Louppe.
linear_model.lasso_path and linear_model.enet_path can return its results in the same format as that of linear_model.lars_path. This is done by setting the return_models parameter to False. By Jaques Grobler and Alexandre Gramfort
grid_search.IterGrid was renamed to grid_search.ParameterGrid.
Fixed bug in KFold causing imperfect class balance in some cases. By Alexandre Gramfort and Tadej Janež.
sklearn.neighbors.BallTree has been refactored, and a sklearn.neighbors.KDTree has been added which shares the same interface. The Ball Tree now works with a wide variety of distance metrics. Both classes have many new methods, including single-tree and dual-tree queries, breadth-first and depth-first searching, and more advanced queries such as kernel density estimation and 2-point correlation functions. By Jake Vanderplas
Support for scipy.spatial.cKDTree within neighbors queries has been removed, and the functionality replaced with the new sklearn.neighbors.KDTree class.
sklearn.neighbors.KernelDensity has been added, which performs efficient kernel density estimation with a variety of kernels.
sklearn.decomposition.KernelPCA now always returns output with n_components components, unless the new parameter remove_zero_eig is set to True. This new behavior is consistent with the way kernel PCA was always documented; previously, the removal of components with zero eigenvalues was tacitly performed on all data.
gcv_mode="auto" no longer tries to perform SVD on a densified sparse matrix in sklearn.linear_model.RidgeCV.
Sparse matrix support in sklearn.decomposition.RandomizedPCA is now deprecated in favor of the new TruncatedSVD.
cross_validation.KFold and cross_validation.StratifiedKFold now enforce n_folds >= 2 otherwise a ValueError is raised. By Olivier Grisel.
datasets.load_files’s charset and charset_errors parameters were renamed encoding and decode_errors.
Attribute oob_score_ in sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.GradientBoostingClassifier is deprecated and has been replaced by oob_improvement_ .
Attributes in OrthogonalMatchingPursuit have been deprecated (copy_X, Gram, …) and precompute_gram renamed precompute for consistency. See #2224.
sklearn.preprocessing.StandardScaler now converts integer input to float, and raises a warning. Previously it rounded for dense integer input.
sklearn.multiclass.OneVsRestClassifier now has a decision_function method. This will return the distance of each sample from the decision boundary for each class, as long as the underlying estimators implement the decision_function method. By Kyle Kastner.
Better input validation, warning on unexpected shapes for y.

People#

List of contributors for release 0.14 by number of commits.

277 Gilles Louppe
245 Lars Buitinck
187 Andreas Mueller
124 Arnaud Joly
112 Jaques Grobler
109 Gael Varoquaux
107 Olivier Grisel
102 Noel Dawe
99 Kemal Eren
79 Joel Nothman
75 Jake VanderPlas
73 Nelle Varoquaux
71 Vlad Niculae
65 Peter Prettenhofer
64 Alexandre Gramfort
54 Mathieu Blondel
38 Nicolas Trésegnie
35 eustache
27 Denis Engemann
25 Yann N. Dauphin
19 Justin Vincent
17 Robert Layton
15 Doug Coleman
14 Michael Eickenberg
13 Robert Marchman
11 Fabian Pedregosa
11 Philippe Gervais
10 Jim Holmström
10 Tadej Janež
10 syhw
9 Mikhail Korobov
9 Steven De Gryze
8 sergeyf
7 Ben Root
7 Hrishikesh Huilgolkar
6 Kyle Kastner
6 Martin Luessi
6 Rob Speer
5 Federico Vaggi
5 Raul Garreta
5 Rob Zinkov
4 Ken Geis
3 A. Flaxman
3 Denton Cockburn
3 Dougal Sutherland
3 Ian Ozsvald
3 Johannes Schönberger
3 Robert McGibbon
3 Roman Sinayev
3 Szabo Roland
2 Diego Molla
2 Imran Haque
2 Jochen Wersdörfer
2 Sergey Karayev
2 Yannick Schwartz
2 jamestwebber
1 Abhijeet Kolhe
1 Alexander Fabisch
1 Bastiaan van den Berg
1 Benjamin Peterson
1 Daniel Velkov
1 Fazlul Shahriar
1 Felix Brockherde
1 Félix-Antoine Fortin
1 Harikrishnan S
1 Jack Hale
1 JakeMick
1 James McDermott
1 John Benediktsson
1 John Zwinck
1 Joshua Vredevoogd
1 Justin Pati
1 Kevin Hughes
1 Kyle Kelley
1 Matthias Ekman
1 Miroslav Shubernetskiy
1 Naoki Orii
1 Norbert Crombach
1 Rafael Cunha de Almeida
1 Rolando Espinoza La fuente
1 Seamus Abshere
1 Sergey Feldman
1 Sergio Medina
1 Stefano Lattarini
1 Steve Koch
1 Sturla Molden
1 Thomas Jarosch
1 Yaroslav Halchenko