.. _example_applications_plot_out_of_core_classification.py: ====================================================== Out-of-core classification of text documents ====================================================== This is an example showing how scikit-learn can be used for classification using an out-of-core approach: learning from data that doesn't fit into main memory. We make use of an online classifier, i.e., one that supports the partial_fit method, that will be fed with batches of examples. To guarantee that the features space remains the same over time we leverage a HashingVectorizer that will project each example into the same feature space. This is especially useful in the case of text classification where new features (words) may appear in each batch. The dataset used in this example is Reuters-21578 as provided by the UCI ML repository. It will be automatically downloaded and uncompressed on first run. The plot represents the learning curve of the classifier: the evolution of classification accuracy over the course of the mini-batches. Accuracy is measured on the first 1000 samples, held out as a validation set. To limit the memory consumption, we queue examples up to a fixed amount before feeding them to the learner. .. rst-class:: horizontal * .. image:: images/plot_out_of_core_classification_001.png :scale: 47 * .. image:: images/plot_out_of_core_classification_002.png :scale: 47 * .. image:: images/plot_out_of_core_classification_003.png :scale: 47 * .. image:: images/plot_out_of_core_classification_004.png :scale: 47 **Script output**:: downloading dataset (once and for all) into /root/scikit_learn_data/reuters untarring Reuters dataset... done. Test set is 955 documents (93 positive) Passive-Aggressive classifier : 969 train docs ( 155 positive) 955 test docs ( 93 positive) accuracy: 0.908 in 2.46s ( 393 docs/s) NB Multinomial classifier : 969 train docs ( 155 positive) 955 test docs ( 93 positive) accuracy: 0.904 in 2.50s ( 387 docs/s) SGD classifier : 969 train docs ( 155 positive) 955 test docs ( 93 positive) accuracy: 0.917 in 2.50s ( 387 docs/s) Perceptron classifier : 969 train docs ( 155 positive) 955 test docs ( 93 positive) accuracy: 0.933 in 2.51s ( 386 docs/s) Passive-Aggressive classifier : 3911 train docs ( 510 positive) 955 test docs ( 93 positive) accuracy: 0.946 in 6.36s ( 615 docs/s) NB Multinomial classifier : 3911 train docs ( 510 positive) 955 test docs ( 93 positive) accuracy: 0.914 in 6.39s ( 611 docs/s) SGD classifier : 3911 train docs ( 510 positive) 955 test docs ( 93 positive) accuracy: 0.952 in 6.40s ( 611 docs/s) Perceptron classifier : 3911 train docs ( 510 positive) 955 test docs ( 93 positive) accuracy: 0.938 in 6.40s ( 611 docs/s) Passive-Aggressive classifier : 5832 train docs ( 733 positive) 955 test docs ( 93 positive) accuracy: 0.959 in 9.84s ( 592 docs/s) NB Multinomial classifier : 5832 train docs ( 733 positive) 955 test docs ( 93 positive) accuracy: 0.920 in 9.88s ( 590 docs/s) SGD classifier : 5832 train docs ( 733 positive) 955 test docs ( 93 positive) accuracy: 0.937 in 9.88s ( 590 docs/s) Perceptron classifier : 5832 train docs ( 733 positive) 955 test docs ( 93 positive) accuracy: 0.931 in 9.89s ( 589 docs/s) Passive-Aggressive classifier : 8647 train docs ( 1090 positive) 955 test docs ( 93 positive) accuracy: 0.955 in 13.75s ( 628 docs/s) NB Multinomial classifier : 8647 train docs ( 1090 positive) 955 test docs ( 93 positive) accuracy: 0.934 in 13.79s ( 627 docs/s) SGD classifier : 8647 train docs ( 1090 positive) 955 test docs ( 93 positive) accuracy: 0.953 in 13.79s ( 627 docs/s) Perceptron classifier : 8647 train docs ( 1090 positive) 955 test docs ( 93 positive) accuracy: 0.947 in 13.79s ( 626 docs/s) Passive-Aggressive classifier : 11468 train docs ( 1548 positive) 955 test docs ( 93 positive) accuracy: 0.957 in 17.57s ( 652 docs/s) NB Multinomial classifier : 11468 train docs ( 1548 positive) 955 test docs ( 93 positive) accuracy: 0.947 in 17.61s ( 651 docs/s) SGD classifier : 11468 train docs ( 1548 positive) 955 test docs ( 93 positive) accuracy: 0.956 in 17.61s ( 651 docs/s) Perceptron classifier : 11468 train docs ( 1548 positive) 955 test docs ( 93 positive) accuracy: 0.936 in 17.61s ( 651 docs/s) Passive-Aggressive classifier : 14335 train docs ( 1823 positive) 955 test docs ( 93 positive) accuracy: 0.959 in 21.41s ( 669 docs/s) NB Multinomial classifier : 14335 train docs ( 1823 positive) 955 test docs ( 93 positive) accuracy: 0.947 in 21.44s ( 668 docs/s) SGD classifier : 14335 train docs ( 1823 positive) 955 test docs ( 93 positive) accuracy: 0.955 in 21.45s ( 668 docs/s) Perceptron classifier : 14335 train docs ( 1823 positive) 955 test docs ( 93 positive) accuracy: 0.941 in 21.45s ( 668 docs/s) Passive-Aggressive classifier : 17206 train docs ( 2148 positive) 955 test docs ( 93 positive) accuracy: 0.961 in 25.22s ( 682 docs/s) NB Multinomial classifier : 17206 train docs ( 2148 positive) 955 test docs ( 93 positive) accuracy: 0.946 in 25.26s ( 681 docs/s) SGD classifier : 17206 train docs ( 2148 positive) 955 test docs ( 93 positive) accuracy: 0.957 in 25.26s ( 681 docs/s) Perceptron classifier : 17206 train docs ( 2148 positive) 955 test docs ( 93 positive) accuracy: 0.946 in 25.26s ( 681 docs/s) **Python source code:** :download:`plot_out_of_core_classification.py ` .. literalinclude:: plot_out_of_core_classification.py :lines: 25- **Total running time of the example:** 29.47 seconds ( 0 minutes 29.47 seconds)