Version 0.14¶
August 7, 2013
Changelog¶
Missing values with sparse and dense matrices can be imputed with the transformer
preprocessing.Imputer
by Nicolas Trésegnie.The core implementation of decisions trees has been rewritten from scratch, allowing for faster tree induction and lower memory consumption in all tree-based estimators. By Gilles Louppe.
Added
ensemble.AdaBoostClassifier
andensemble.AdaBoostRegressor
, by Noel Dawe and Gilles Louppe. See the AdaBoost section of the user guide for details and examples.Added
grid_search.RandomizedSearchCV
andgrid_search.ParameterSampler
for randomized hyperparameter optimization. By Andreas Müller.Added biclustering algorithms (
sklearn.cluster.bicluster.SpectralCoclustering
andsklearn.cluster.bicluster.SpectralBiclustering
), data generation methods (sklearn.datasets.make_biclusters
andsklearn.datasets.make_checkerboard
), and scoring metrics (sklearn.metrics.consensus_score
). By Kemal Eren.Added Restricted Boltzmann Machines (
neural_network.BernoulliRBM
). By Yann Dauphin.Python 3 support by Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel. All tests now pass under Python 3.3.
Ability to pass one penalty (alpha value) per target in
linear_model.Ridge
, by @eickenberg and Mathieu Blondel.Fixed
sklearn.linear_model.stochastic_gradient.py
L2 regularization issue (minor practical significance). By Norbert Crombach and Mathieu Blondel .Added an interactive version of Andreas Müller’s Machine Learning Cheat Sheet (for scikit-learn) to the documentation. See Choosing the right estimator. By Jaques Grobler.
grid_search.GridSearchCV
andcross_validation.cross_val_score
now support the use of advanced scoring function such as area under the ROC curve and f-beta scores. See The scoring parameter: defining model evaluation rules for details. By Andreas Müller and Lars Buitinck. Passing a function fromsklearn.metrics
asscore_func
is deprecated.Multi-label classification output is now supported by
metrics.accuracy_score
,metrics.zero_one_loss
,metrics.f1_score
,metrics.fbeta_score
,metrics.classification_report
,metrics.precision_score
andmetrics.recall_score
by Arnaud Joly.Two new metrics
metrics.hamming_loss
andmetrics.jaccard_similarity_score
are added with multi-label support by Arnaud Joly.Speed and memory usage improvements in
feature_extraction.text.CountVectorizer
andfeature_extraction.text.TfidfVectorizer
, by Jochen Wersdörfer and Roman Sinayev.The
min_df
parameter infeature_extraction.text.CountVectorizer
andfeature_extraction.text.TfidfVectorizer
, which used to be 2, has been reset to 1 to avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A value of at least 2 is still recommended for practical use.svm.LinearSVC
,linear_model.SGDClassifier
andlinear_model.SGDRegressor
now have asparsify
method that converts theircoef_
into a sparse matrix, meaning stored models trained using these estimators can be made much more compact.linear_model.SGDClassifier
now produces multiclass probability estimates when trained under log loss or modified Huber loss.Hyperlinks to documentation in example code on the website by Martin Luessi.
Fixed bug in
preprocessing.MinMaxScaler
causing incorrect scaling of the features for non-defaultfeature_range
settings. By Andreas Müller.max_features
intree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
and all derived ensemble estimators now supports percentage values. By Gilles Louppe.Performance improvements in
isotonic.IsotonicRegression
by Nelle Varoquaux.metrics.accuracy_score
has an option normalize to return the fraction or the number of correctly classified sample by Arnaud Joly.Added
metrics.log_loss
that computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars Buitinck.A bug that caused
ensemble.AdaBoostClassifier
’s to output incorrect probabilities has been fixed.Feature selectors now share a mixin providing consistent
transform
,inverse_transform
andget_support
methods. By Joel Nothman.A fitted
grid_search.GridSearchCV
orgrid_search.RandomizedSearchCV
can now generally be pickled. By Joel Nothman.Refactored and vectorized implementation of
metrics.roc_curve
andmetrics.precision_recall_curve
. By Joel Nothman.The new estimator
sklearn.decomposition.TruncatedSVD
performs dimensionality reduction using SVD on sparse matrices, and can be used for latent semantic analysis (LSA). By Lars Buitinck.Added self-contained example of out-of-core learning on text data Out-of-core classification of text documents. By Eustache Diemert.
The default number of components for
sklearn.decomposition.RandomizedPCA
is now correctly documented to ben_features
. This was the default behavior, so programs using it will continue to work as they did.sklearn.cluster.KMeans
now fits several orders of magnitude faster on sparse data (the speedup depends on the sparsity). By Lars Buitinck.Reduce memory footprint of FastICA by Denis Engemann and Alexandre Gramfort.
Verbose output in
sklearn.ensemble.gradient_boosting
now uses a column format and prints progress in decreasing frequency. It also shows the remaining time. By Peter Prettenhofer.sklearn.ensemble.gradient_boosting
provides out-of-bag improvementoob_improvement_
rather than the OOB score for model selection. An example that shows how to use OOB estimates to select the number of trees was added. By Peter Prettenhofer.Most metrics now support string labels for multiclass classification by Arnaud Joly and Lars Buitinck.
New OrthogonalMatchingPursuitCV class by Alexandre Gramfort and Vlad Niculae.
Fixed a bug in
sklearn.covariance.GraphLassoCV
: the ‘alphas’ parameter now works as expected when given a list of values. By Philippe Gervais.Fixed an important bug in
sklearn.covariance.GraphLassoCV
that prevented all folds provided by a CV object to be used (only the first 3 were used). When providing a CV object, execution time may thus increase significantly compared to the previous version (bug results are correct now). By Philippe Gervais.cross_validation.cross_val_score
and thegrid_search
module is now tested with multi-output data by Arnaud Joly.datasets.make_multilabel_classification
can now return the output in label indicator multilabel format by Arnaud Joly.K-nearest neighbors,
neighbors.KNeighborsRegressor
andneighbors.RadiusNeighborsRegressor
, and radius neighbors,neighbors.RadiusNeighborsRegressor
andneighbors.RadiusNeighborsClassifier
support multioutput data by Arnaud Joly.Random state in LibSVM-based estimators (
svm.SVC
,svm.NuSVC
,svm.OneClassSVM
,svm.SVR
,svm.NuSVR
) can now be controlled. This is useful to ensure consistency in the probability estimates for the classifiers trained withprobability=True
. By Vlad Niculae.Out-of-core learning support for discrete naive Bayes classifiers
sklearn.naive_bayes.MultinomialNB
andsklearn.naive_bayes.BernoulliNB
by adding thepartial_fit
method by Olivier Grisel.New website design and navigation by Gilles Louppe, Nelle Varoquaux, Vincent Michel and Andreas Müller.
Improved documentation on multi-class, multi-label and multi-output classification by Yannick Schwartz and Arnaud Joly.
Better input and error handling in the
sklearn.metrics
module by Arnaud Joly and Joel Nothman.Speed optimization of the
hmm
module by Mikhail KorobovSignificant speed improvements for
sklearn.cluster.DBSCAN
by cleverless
API changes summary¶
The
auc_score
was renamedmetrics.roc_auc_score
.Testing scikit-learn with
sklearn.test()
is deprecated. Usenosetests sklearn
from the command line.Feature importances in
tree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
and all derived ensemble estimators are now computed on the fly when accessing thefeature_importances_
attribute. Settingcompute_importances=True
is no longer required. By Gilles Louppe.linear_model.lasso_path
andlinear_model.enet_path
can return its results in the same format as that oflinear_model.lars_path
. This is done by setting thereturn_models
parameter toFalse
. By Jaques Grobler and Alexandre Gramfortgrid_search.IterGrid
was renamed togrid_search.ParameterGrid
.Fixed bug in
KFold
causing imperfect class balance in some cases. By Alexandre Gramfort and Tadej Janež.sklearn.neighbors.BallTree
has been refactored, and asklearn.neighbors.KDTree
has been added which shares the same interface. The Ball Tree now works with a wide variety of distance metrics. Both classes have many new methods, including single-tree and dual-tree queries, breadth-first and depth-first searching, and more advanced queries such as kernel density estimation and 2-point correlation functions. By Jake VanderplasSupport for scipy.spatial.cKDTree within neighbors queries has been removed, and the functionality replaced with the new
sklearn.neighbors.KDTree
class.sklearn.neighbors.KernelDensity
has been added, which performs efficient kernel density estimation with a variety of kernels.sklearn.decomposition.KernelPCA
now always returns output withn_components
components, unless the new parameterremove_zero_eig
is set toTrue
. This new behavior is consistent with the way kernel PCA was always documented; previously, the removal of components with zero eigenvalues was tacitly performed on all data.gcv_mode="auto"
no longer tries to perform SVD on a densified sparse matrix insklearn.linear_model.RidgeCV
.Sparse matrix support in
sklearn.decomposition.RandomizedPCA
is now deprecated in favor of the newTruncatedSVD
.cross_validation.KFold
andcross_validation.StratifiedKFold
now enforcen_folds >= 2
otherwise aValueError
is raised. By Olivier Grisel.datasets.load_files
’scharset
andcharset_errors
parameters were renamedencoding
anddecode_errors
.Attribute
oob_score_
insklearn.ensemble.GradientBoostingRegressor
andsklearn.ensemble.GradientBoostingClassifier
is deprecated and has been replaced byoob_improvement_
.Attributes in OrthogonalMatchingPursuit have been deprecated (copy_X, Gram, …) and precompute_gram renamed precompute for consistency. See #2224.
sklearn.preprocessing.StandardScaler
now converts integer input to float, and raises a warning. Previously it rounded for dense integer input.sklearn.multiclass.OneVsRestClassifier
now has adecision_function
method. This will return the distance of each sample from the decision boundary for each class, as long as the underlying estimators implement thedecision_function
method. By Kyle Kastner.Better input validation, warning on unexpected shapes for y.
People¶
List of contributors for release 0.14 by number of commits.
277 Gilles Louppe
245 Lars Buitinck
187 Andreas Mueller
124 Arnaud Joly
112 Jaques Grobler
109 Gael Varoquaux
107 Olivier Grisel
102 Noel Dawe
99 Kemal Eren
79 Joel Nothman
75 Jake VanderPlas
73 Nelle Varoquaux
71 Vlad Niculae
65 Peter Prettenhofer
64 Alexandre Gramfort
54 Mathieu Blondel
38 Nicolas Trésegnie
35 eustache
27 Denis Engemann
25 Yann N. Dauphin
19 Justin Vincent
17 Robert Layton
15 Doug Coleman
14 Michael Eickenberg
13 Robert Marchman
11 Fabian Pedregosa
11 Philippe Gervais
10 Jim Holmström
10 Tadej Janež
10 syhw
9 Mikhail Korobov
9 Steven De Gryze
8 sergeyf
7 Ben Root
7 Hrishikesh Huilgolkar
6 Kyle Kastner
6 Martin Luessi
6 Rob Speer
5 Federico Vaggi
5 Raul Garreta
5 Rob Zinkov
4 Ken Geis
3 A. Flaxman
3 Denton Cockburn
3 Dougal Sutherland
3 Ian Ozsvald
3 Johannes Schönberger
3 Robert McGibbon
3 Roman Sinayev
3 Szabo Roland
2 Diego Molla
2 Imran Haque
2 Jochen Wersdörfer
2 Sergey Karayev
2 Yannick Schwartz
2 jamestwebber
1 Abhijeet Kolhe
1 Alexander Fabisch
1 Bastiaan van den Berg
1 Benjamin Peterson
1 Daniel Velkov
1 Fazlul Shahriar
1 Felix Brockherde
1 Félix-Antoine Fortin
1 Harikrishnan S
1 Jack Hale
1 JakeMick
1 James McDermott
1 John Benediktsson
1 John Zwinck
1 Joshua Vredevoogd
1 Justin Pati
1 Kevin Hughes
1 Kyle Kelley
1 Matthias Ekman
1 Miroslav Shubernetskiy
1 Naoki Orii
1 Norbert Crombach
1 Rafael Cunha de Almeida
1 Rolando Espinoza La fuente
1 Seamus Abshere
1 Sergey Feldman
1 Sergio Medina
1 Stefano Lattarini
1 Steve Koch
1 Sturla Molden
1 Thomas Jarosch
1 Yaroslav Halchenko