- scikit-learn Tutorials
- An introduction to machine learning with scikit-learn
- A tutorial on statistical-learning for scientific data processing
- Statistical learning: the setting and the estimator object in scikit-learn
- Supervised learning: predicting an output variable from high-dimensional observations
- Model selection: choosing estimators and their parameters
- Unsupervised learning: seeking representations of the data
- Putting it all together
- Finding help
- Working With Text Data
- Tutorial setup
- Loading the 20 newsgroups dataset
- Extracting features from text files
- Training a classifier
- Building a pipeline
- Evaluation of the performance on the test set
- Parameter tuning using grid search
- Exercise 1: Language identification
- Exercise 2: Sentiment Analysis on movie reviews
- Exercise 3: CLI text classification utility
- Where to from here
- 1. Supervised learning
- 1.1. Generalized Linear Models
- 1.1.1. Ordinary Least Squares
- 1.1.2. Ridge Regression
- 1.1.3. Lasso
- 1.1.4. Elastic Net
- 1.1.5. Multi-task Lasso
- 1.1.6. Least Angle Regression
- 1.1.7. LARS Lasso
- 1.1.8. Orthogonal Matching Pursuit (OMP)
- 1.1.9. Bayesian Regression
- 1.1.10. Logistic regression
- 1.1.11. Stochastic Gradient Descent - SGD
- 1.1.12. Perceptron
- 1.1.13. Passive Aggressive Algorithms
- 1.1.14. Robustness regression: outliers and modeling errors
- 1.1.15. Polynomial regression: extending linear models with basis functions
- 1.2. Support Vector Machines
- 1.3. Stochastic Gradient Descent
- 1.4. Nearest Neighbors
- 1.5. Gaussian Processes
- 1.6. Cross decomposition
- 1.7. Naive Bayes
- 1.8. Decision Trees
- 1.9. Ensemble methods
- 1.9.1. Bagging meta-estimator
- 1.9.2. Forests of randomized trees
- 1.9.3. AdaBoost
- 1.9.4. Gradient Tree Boosting
- 1.10. Multiclass and multilabel algorithms
- 1.11. Feature selection
- 1.12. Semi-Supervised
- 1.13. Linear and quadratic discriminant analysis
- 1.14. Isotonic regression
- 1.1. Generalized Linear Models
- 2. Unsupervised learning
- 2.1. Gaussian mixture models
- 2.2. Manifold learning
- 2.2.1. Introduction
- 2.2.2. Isomap
- 2.2.3. Locally Linear Embedding
- 2.2.4. Modified Locally Linear Embedding
- 2.2.5. Hessian Eigenmapping
- 2.2.6. Spectral Embedding
- 2.2.7. Local Tangent Space Alignment
- 2.2.8. Multi-dimensional Scaling (MDS)
- 2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)
- 2.2.10. Tips on practical use
- 2.3. Clustering
- 2.3.1. Overview of clustering methods
- 2.3.2. K-means
- 2.3.3. Affinity Propagation
- 2.3.4. Mean Shift
- 2.3.5. Spectral clustering
- 2.3.6. Hierarchical clustering
- 2.3.7. DBSCAN
- 2.3.8. Birch
- 2.3.9. Clustering performance evaluation
- 2.4. Biclustering
- 2.5. Decomposing signals in components (matrix factorization problems)
- 2.6. Covariance estimation
- 2.7. Novelty and Outlier Detection
- 2.8. Density Estimation
- 2.9. Neural network models (unsupervised)
- 3. Model selection and evaluation
- 3.1. Cross-validation: evaluating estimator performance
- 3.2. Grid Search: Searching for estimator parameters
- 3.2.1. Exhaustive Grid Search
- 3.2.2. Randomized Parameter Optimization
- 3.2.3. Tips for parameter search
- 3.2.4. Alternatives to brute force parameter search
- 3.2.4.1. Model specific cross-validation
- 3.2.4.1.1.
sklearn.linear_model.ElasticNetCV - 3.2.4.1.2.
sklearn.linear_model.LarsCV - 3.2.4.1.3.
sklearn.linear_model.LassoCV - 3.2.4.1.4.
sklearn.linear_model.LassoLarsCV - 3.2.4.1.5.
sklearn.linear_model.LogisticRegressionCV - 3.2.4.1.6.
sklearn.linear_model.MultiTaskElasticNetCV - 3.2.4.1.7.
sklearn.linear_model.MultiTaskLassoCV - 3.2.4.1.8.
sklearn.linear_model.OrthogonalMatchingPursuitCV - 3.2.4.1.9.
sklearn.linear_model.RidgeCV - 3.2.4.1.10.
sklearn.linear_model.RidgeClassifierCV
- 3.2.4.1.1.
- 3.2.4.2. Information Criterion
- 3.2.4.3. Out of Bag Estimates
- 3.2.4.3.1.
sklearn.ensemble.RandomForestClassifier - 3.2.4.3.2.
sklearn.ensemble.RandomForestRegressor - 3.2.4.3.3.
sklearn.ensemble.ExtraTreesClassifier - 3.2.4.3.4.
sklearn.ensemble.ExtraTreesRegressor - 3.2.4.3.5.
sklearn.ensemble.GradientBoostingClassifier - 3.2.4.3.6.
sklearn.ensemble.GradientBoostingRegressor
- 3.2.4.3.1.
- 3.2.4.1. Model specific cross-validation
- 3.3. Model evaluation: quantifying the quality of predictions
- 3.3.1. The
scoringparameter: defining model evaluation rules - 3.3.2. Classification metrics
- 3.3.2.1. Accuracy score
- 3.3.2.2. Confusion matrix
- 3.3.2.3. Classification report
- 3.3.2.4. Hamming loss
- 3.3.2.5. Jaccard similarity coefficient score
- 3.3.2.6. Precision, recall and F-measures
- 3.3.2.7. Hinge loss
- 3.3.2.8. Log loss
- 3.3.2.9. Matthews correlation coefficient
- 3.3.2.10. Receiver operating characteristic (ROC)
- 3.3.2.11. Zero one loss
- 3.3.3. Multilabel ranking metrics
- 3.3.4. Regression metrics
- 3.3.5. Clustering metrics
- 3.3.6. Dummy estimators
- 3.3.1. The
- 3.4. Model persistence
- 3.5. Validation curves: plotting scores to evaluate models
- 4. Dataset transformations
- 4.1. Pipeline and FeatureUnion: combining estimators
- 4.2. Feature extraction
- 4.2.1. Loading features from dicts
- 4.2.2. Feature hashing
- 4.2.3. Text feature extraction
- 4.2.3.1. The Bag of Words representation
- 4.2.3.2. Sparsity
- 4.2.3.3. Common Vectorizer usage
- 4.2.3.4. Tf–idf term weighting
- 4.2.3.5. Decoding text files
- 4.2.3.6. Applications and examples
- 4.2.3.7. Limitations of the Bag of Words representation
- 4.2.3.8. Vectorizing a large text corpus with the hashing trick
- 4.2.3.9. Performing out-of-core scaling with HashingVectorizer
- 4.2.3.10. Customizing the vectorizer classes
- 4.2.4. Image feature extraction
- 4.3. Preprocessing data
- 4.4. Unsupervised dimensionality reduction
- 4.5. Random Projection
- 4.6. Kernel Approximation
- 4.7. Pairwise metrics, Affinities and Kernels
- 4.8. Transforming the prediction target (
y)
- 5. Dataset loading utilities
- 5.1. General dataset API
- 5.2. Toy datasets
- 5.3. Sample images
- 5.4. Sample generators
- 5.5. Datasets in svmlight / libsvm format
- 5.6. The Olivetti faces dataset
- 5.7. The 20 newsgroups text dataset
- 5.8. Downloading datasets from the mldata.org repository
- 5.9. The Labeled Faces in the Wild face recognition dataset
- 5.10. Forest covertypes
- 6. Strategies to scale computationally: bigger data
- 7. Computational Performance
- Examples
- General examples
- Examples based on real world datasets
- Biclustering
- Classification
- Clustering
- Covariance estimation
- Cross decomposition
- Dataset examples
- Decomposition
- Ensemble methods
- Tutorial exercises
- Feature Selection
- Gaussian Process for Machine Learning
- Generalized Linear Models
- Manifold learning
- Gaussian Mixture Models
- Model Selection
- Nearest Neighbors
- Neural Networks
- Semi Supervised Classification
- Support Vector Machines
- Working with text documents
- Decision Trees
- General examples
- Frequently Asked Questions
- What is the project name (a lot of people get it wrong)?
- How do you pronounce the project name?
- Why scikit?
- How can I contribute to scikit-learn?
- Can I add this new algorithm that I (or someone else) just published?
- Can I add this classical algorithm from the 80s?
- Why did you remove HMMs from scikit-learn?
- Will you add graphical models or sequence prediction to scikit-learn?
- Will you add GPU support?
- Do you support PyPy?
- How do I deal with string data (or trees, graphs...)?
- Support
- 0.16
- 0.15.2
- 0.15.1
- 0.15
- 0.14
- 0.13.1
- 0.13
- 0.12.1
- 0.12
- 0.11
- 0.10
- 0.9
- 0.8
- 0.7
- 0.6
- 0.5
- 0.4
- Earlier versions
- External Resources, Videos and Talks
- About us
- Documentation of scikit-learn 0.16-git
- 5. Dataset loading utilities
- 5.1. General dataset API
- 5.2. Toy datasets
- 5.3. Sample images
- 5.4. Sample generators
- 5.5. Datasets in svmlight / libsvm format
- 5.6. The Olivetti faces dataset
- 5.7. The 20 newsgroups text dataset
- 5.8. Downloading datasets from the mldata.org repository
- 5.9. The Labeled Faces in the Wild face recognition dataset
- 5.10. Forest covertypes
- Forest covertypes
- The Labeled Faces in the Wild face recognition dataset
- Downloading datasets from the mldata.org repository
- The Olivetti faces dataset
- The 20 newsgroups text dataset
- Reference
sklearn.base: Base classes and utility functionssklearn.cluster: Clustering- Classes
- Functions
sklearn.cluster.bicluster: Biclusteringsklearn.covariance: Covariance Estimatorssklearn.covariance.EmpiricalCovariancesklearn.covariance.EllipticEnvelopesklearn.covariance.GraphLassosklearn.covariance.GraphLassoCVsklearn.covariance.LedoitWolfsklearn.covariance.MinCovDetsklearn.covariance.OASsklearn.covariance.ShrunkCovariancesklearn.covariance.empirical_covariancesklearn.covariance.ledoit_wolfsklearn.covariance.shrunk_covariancesklearn.covariance.oassklearn.covariance.graph_lasso
sklearn.cross_validation: Cross Validationsklearn.cross_validation.KFoldsklearn.cross_validation.LeaveOneLabelOutsklearn.cross_validation.LeaveOneOutsklearn.cross_validation.LeavePLabelOutsklearn.cross_validation.LeavePOutsklearn.cross_validation.StratifiedKFoldsklearn.cross_validation.ShuffleSplitsklearn.cross_validation.StratifiedShuffleSplitsklearn.cross_validation.train_test_splitsklearn.cross_validation.cross_val_scoresklearn.cross_validation.cross_val_predictsklearn.cross_validation.permutation_test_scoresklearn.cross_validation.check_cv
sklearn.datasets: Datasets- Loaders
sklearn.datasets.clear_data_homesklearn.datasets.get_data_homesklearn.datasets.fetch_20newsgroupssklearn.datasets.fetch_20newsgroups_vectorizedsklearn.datasets.load_bostonsklearn.datasets.load_diabetessklearn.datasets.load_digitssklearn.datasets.load_filessklearn.datasets.load_irissklearn.datasets.load_lfw_pairssklearn.datasets.fetch_lfw_pairssklearn.datasets.load_lfw_peoplesklearn.datasets.fetch_lfw_peoplesklearn.datasets.load_linnerudsklearn.datasets.mldata_filenamesklearn.datasets.fetch_mldatasklearn.datasets.fetch_olivetti_facessklearn.datasets.fetch_california_housingsklearn.datasets.fetch_covtypesklearn.datasets.load_mlcompsklearn.datasets.load_sample_imagesklearn.datasets.load_sample_imagessklearn.datasets.load_svmlight_filesklearn.datasets.load_svmlight_filessklearn.datasets.dump_svmlight_file
- Samples generator
sklearn.datasets.make_blobssklearn.datasets.make_classificationsklearn.datasets.make_circlessklearn.datasets.make_friedman1sklearn.datasets.make_friedman2sklearn.datasets.make_friedman3sklearn.datasets.make_gaussian_quantilessklearn.datasets.make_hastie_10_2sklearn.datasets.make_low_rank_matrixsklearn.datasets.make_moonssklearn.datasets.make_multilabel_classificationsklearn.datasets.make_regressionsklearn.datasets.make_s_curvesklearn.datasets.make_sparse_coded_signalsklearn.datasets.make_sparse_spd_matrixsklearn.datasets.make_sparse_uncorrelatedsklearn.datasets.make_spd_matrixsklearn.datasets.make_swiss_rollsklearn.datasets.make_biclusterssklearn.datasets.make_checkerboard
- Loaders
sklearn.decomposition: Matrix Decompositionsklearn.decomposition.PCAsklearn.decomposition.IncrementalPCAsklearn.decomposition.ProjectedGradientNMFsklearn.decomposition.RandomizedPCAsklearn.decomposition.KernelPCAsklearn.decomposition.FactorAnalysissklearn.decomposition.FastICAsklearn.decomposition.TruncatedSVDsklearn.decomposition.NMFsklearn.decomposition.SparsePCAsklearn.decomposition.MiniBatchSparsePCAsklearn.decomposition.SparseCodersklearn.decomposition.DictionaryLearningsklearn.decomposition.MiniBatchDictionaryLearningsklearn.decomposition.fasticasklearn.decomposition.dict_learningsklearn.decomposition.dict_learning_onlinesklearn.decomposition.sparse_encode
sklearn.dummy: Dummy estimatorssklearn.ensemble: Ensemble Methodssklearn.ensemble.AdaBoostClassifiersklearn.ensemble.AdaBoostRegressorsklearn.ensemble.BaggingClassifiersklearn.ensemble.BaggingRegressor- 3.2.4.3.3.
sklearn.ensemble.ExtraTreesClassifier - 3.2.4.3.4.
sklearn.ensemble.ExtraTreesRegressor - 3.2.4.3.5.
sklearn.ensemble.GradientBoostingClassifier - 3.2.4.3.6.
sklearn.ensemble.GradientBoostingRegressor - 3.2.4.3.1.
sklearn.ensemble.RandomForestClassifier sklearn.ensemble.RandomTreesEmbedding- 3.2.4.3.2.
sklearn.ensemble.RandomForestRegressor - partial dependence
sklearn.feature_extraction: Feature Extractionsklearn.feature_extraction.DictVectorizersklearn.feature_extraction.FeatureHasher- From images
- From text
sklearn.feature_selection: Feature Selectionsklearn.feature_selection.GenericUnivariateSelectsklearn.feature_selection.SelectPercentilesklearn.feature_selection.SelectKBestsklearn.feature_selection.SelectFprsklearn.feature_selection.SelectFdrsklearn.feature_selection.SelectFwesklearn.feature_selection.RFEsklearn.feature_selection.RFECVsklearn.feature_selection.VarianceThresholdsklearn.feature_selection.chi2sklearn.feature_selection.f_classifsklearn.feature_selection.f_regression
sklearn.gaussian_process: Gaussian Processessklearn.gaussian_process.GaussianProcesssklearn.gaussian_process.correlation_models.absolute_exponentialsklearn.gaussian_process.correlation_models.squared_exponentialsklearn.gaussian_process.correlation_models.generalized_exponentialsklearn.gaussian_process.correlation_models.pure_nuggetsklearn.gaussian_process.correlation_models.cubicsklearn.gaussian_process.correlation_models.linearsklearn.gaussian_process.regression_models.constantsklearn.gaussian_process.regression_models.linearsklearn.gaussian_process.regression_models.quadratic
sklearn.grid_search: Grid Searchsklearn.isotonic: Isotonic regressionsklearn.kernel_approximationKernel Approximationsklearn.lda: Linear Discriminant Analysissklearn.learning_curveLearning curve evaluationsklearn.linear_model: Generalized Linear Modelssklearn.linear_model.ARDRegressionsklearn.linear_model.BayesianRidgesklearn.linear_model.ElasticNet- 3.2.4.1.1.
sklearn.linear_model.ElasticNetCV sklearn.linear_model.Lars- 3.2.4.1.2.
sklearn.linear_model.LarsCV sklearn.linear_model.Lasso- 3.2.4.1.3.
sklearn.linear_model.LassoCV sklearn.linear_model.LassoLars- 3.2.4.1.4.
sklearn.linear_model.LassoLarsCV - 3.2.4.2.1.
sklearn.linear_model.LassoLarsIC sklearn.linear_model.LinearRegressionsklearn.linear_model.LogisticRegression- 3.2.4.1.5.
sklearn.linear_model.LogisticRegressionCV sklearn.linear_model.MultiTaskLassosklearn.linear_model.MultiTaskElasticNet- 3.2.4.1.7.
sklearn.linear_model.MultiTaskLassoCV - 3.2.4.1.6.
sklearn.linear_model.MultiTaskElasticNetCV sklearn.linear_model.OrthogonalMatchingPursuit- 3.2.4.1.8.
sklearn.linear_model.OrthogonalMatchingPursuitCV sklearn.linear_model.PassiveAggressiveClassifiersklearn.linear_model.PassiveAggressiveRegressorsklearn.linear_model.Perceptronsklearn.linear_model.RandomizedLassosklearn.linear_model.RandomizedLogisticRegressionsklearn.linear_model.RANSACRegressorsklearn.linear_model.Ridgesklearn.linear_model.RidgeClassifier- 3.2.4.1.10.
sklearn.linear_model.RidgeClassifierCV - 3.2.4.1.9.
sklearn.linear_model.RidgeCV sklearn.linear_model.SGDClassifiersklearn.linear_model.SGDRegressorsklearn.linear_model.TheilSenRegressorsklearn.linear_model.lars_pathsklearn.linear_model.lasso_pathsklearn.linear_model.lasso_stability_pathsklearn.linear_model.orthogonal_mpsklearn.linear_model.orthogonal_mp_gram
sklearn.manifold: Manifold Learningsklearn.metrics: Metrics- Model Selection Interface
- Classification metrics
sklearn.metrics.accuracy_scoresklearn.metrics.aucsklearn.metrics.average_precision_scoresklearn.metrics.classification_reportsklearn.metrics.confusion_matrixsklearn.metrics.f1_scoresklearn.metrics.fbeta_scoresklearn.metrics.hamming_losssklearn.metrics.hinge_losssklearn.metrics.jaccard_similarity_scoresklearn.metrics.log_losssklearn.metrics.matthews_corrcoefsklearn.metrics.precision_recall_curvesklearn.metrics.precision_recall_fscore_supportsklearn.metrics.precision_scoresklearn.metrics.recall_scoresklearn.metrics.roc_auc_scoresklearn.metrics.roc_curvesklearn.metrics.zero_one_loss
- Regression metrics
- Multilabel ranking metrics
- Clustering metrics
sklearn.metrics.adjusted_mutual_info_scoresklearn.metrics.adjusted_rand_scoresklearn.metrics.completeness_scoresklearn.metrics.homogeneity_completeness_v_measuresklearn.metrics.homogeneity_scoresklearn.metrics.mutual_info_scoresklearn.metrics.normalized_mutual_info_scoresklearn.metrics.silhouette_scoresklearn.metrics.silhouette_samplessklearn.metrics.v_measure_score
- Biclustering metrics
- Pairwise metrics
sklearn.metrics.pairwise.additive_chi2_kernelsklearn.metrics.pairwise.chi2_kernelsklearn.metrics.pairwise.distance_metricssklearn.metrics.pairwise.euclidean_distancessklearn.metrics.pairwise.kernel_metricssklearn.metrics.pairwise.linear_kernelsklearn.metrics.pairwise.manhattan_distancessklearn.metrics.pairwise.pairwise_distancessklearn.metrics.pairwise.pairwise_kernelssklearn.metrics.pairwise.polynomial_kernelsklearn.metrics.pairwise.rbf_kernelsklearn.metrics.pairwise_distancessklearn.metrics.pairwise_distances_argminsklearn.metrics.pairwise_distances_argmin_min
sklearn.mixture: Gaussian Mixture Modelssklearn.multiclass: Multiclass and multilabel classificationsklearn.naive_bayes: Naive Bayessklearn.neighbors: Nearest Neighborssklearn.neighbors.NearestNeighborssklearn.neighbors.KNeighborsClassifiersklearn.neighbors.RadiusNeighborsClassifiersklearn.neighbors.KNeighborsRegressorsklearn.neighbors.RadiusNeighborsRegressorsklearn.neighbors.NearestCentroidsklearn.neighbors.BallTreesklearn.neighbors.KDTreesklearn.neighbors.LSHForestsklearn.neighbors.DistanceMetricsklearn.neighbors.KernelDensitysklearn.neighbors.kneighbors_graphsklearn.neighbors.radius_neighbors_graph
sklearn.neural_network: Neural network modelssklearn.cross_decomposition: Cross decompositionsklearn.pipeline: Pipelinesklearn.preprocessing: Preprocessing and Normalizationsklearn.preprocessing.Binarizersklearn.preprocessing.Imputersklearn.preprocessing.KernelCenterersklearn.preprocessing.LabelBinarizersklearn.preprocessing.LabelEncodersklearn.preprocessing.MultiLabelBinarizersklearn.preprocessing.MinMaxScalersklearn.preprocessing.Normalizersklearn.preprocessing.OneHotEncodersklearn.preprocessing.StandardScalersklearn.preprocessing.PolynomialFeaturessklearn.preprocessing.add_dummy_featuresklearn.preprocessing.binarizesklearn.preprocessing.label_binarizesklearn.preprocessing.normalizesklearn.preprocessing.scale
sklearn.qda: Quadratic Discriminant Analysissklearn.random_projection: Random projectionsklearn.semi_supervisedSemi-Supervised Learningsklearn.svm: Support Vector Machinessklearn.tree: Decision Treessklearn.utils: Utilities
- Who is using scikit-learn?
- Contributing
- Developers’ Tips for Debugging
- Maintainer / core-developer information
- How to optimize for speed
- Utilities for Developers
- Installing scikit-learn
- An introduction to machine learning with scikit-learn
- Choosing the right estimator
Identifying to which category an object belongs to.
Applications: Spam detection, Image recognition. Algorithms:SVM, nearest neighbors, random forest, ...
Predicting a continuous-valued attribute associated with an object.
Applications: Drug response, Stock prices. Algorithms:SVR, ridge regression, Lasso, ...
Automatic grouping of similar objects into sets.
Applications: Customer segmentation, Grouping experiment outcomes Algorithms:k-Means, spectral clustering, mean-shift, ...
Reducing the number of random variables to consider.
Applications: Visualization, Increased efficiency Algorithms:
Comparing, validating and choosing parameters and models.
Goal: Improved accuracy via parameter tuning Modules:
Feature extraction and normalization.
Application: Transforming input data such as text for use with machine learning algorithms. Modules:preprocessing, feature extraction.
News
- On-going development: What's new (changelog)
- July 2014. scikit-learn 0.15.0 is available for download (Changelog).
- July 14-20th, 2014: international sprint. During this week-long sprint, we gathered 18 of the core contributors in Paris. We want to thank our sponsors: Paris-Saclay Center for Data Science & Digicosme and our hosts La Paillasse, Criteo, Inria, and tinyclues.
- August 2013. scikit-learn 0.14 is available for download (Changelog).
Community
- About us See authors # scikit-learn
- More Machine Learning Find related projects
- Questions? See stackoverflow # scikit-learn
- Mailing list: scikit-learn-general@lists.sourceforge.net
- IRC: #scikit-learn @ freenode









