sklearn.datasets.dump_svmlight_file(X, y, f, *, zero_based=True, comment=None, query_id=None, multilabel=False)[source]#

Dump the dataset in svmlight / libsvm file format.

This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features.

y{array-like, sparse matrix}, shape = (n_samples,) or (n_samples, n_labels)

Target values. Class labels must be an integer or float, or array-like objects of integer or float for multilabel classifications.

fstr or file-like in binary mode

If string, specifies the path that will contain the data. If file-like, data will be written to f. f should be opened in binary mode.

zero_basedbool, default=True

Whether column indices should be written zero-based (True) or one-based (False).

commentstr or bytes, default=None

Comment to insert at the top of the file. This should be either a Unicode string, which will be encoded as UTF-8, or an ASCII byte string. If a comment is given, then it will be preceded by one that identifies the file as having been dumped by scikit-learn. Note that not all tools grok comments in SVMlight files.

query_idarray-like of shape (n_samples,), default=None

Array containing pairwise preference constraints (qid in svmlight format).

multilabelbool, default=False

Samples may have several labels each (see https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html).

Added in version 0.17: parameter multilabel to support multilabel datasets.


>>> from sklearn.datasets import dump_svmlight_file, make_classification
>>> X, y = make_classification(random_state=0)
>>> output_file = "my_dataset.svmlight"
>>> dump_svmlight_file(X, y, output_file)