load_svmlight_file(f, *, n_features=None, dtype=<class 'numpy.float64'>, multilabel=False, zero_based='auto', query_id=False, offset=0, length=-1)¶
Load datasets in the svmlight / libsvm format into sparse CSR matrix
This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.
The first element of each line can be used to store a target variable to predict.
This format is used as the default format for both svmlight and the libsvm command line programs.
Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recommended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls.
In case the file contains a pairwise preference constraint (known as “qid” in the svmlight format) these are ignored unless the query_id parameter is set to True. These pairwise preference constraints can be used to constraint the combination of samples when using pairwise loss functions (as is the case in some learning to rank problems) so that only pairs with the same query_id value are considered.
This implementation is written in Cython and is reasonably fast. However, a faster API-compatible loader is also available at:
- fstr, file-like or int
(Path to) a file to load. If a path ends in “.gz” or “.bz2”, it will be uncompressed on the fly. If an integer is passed, it is assumed to be a file descriptor. A file-like or file descriptor will not be closed by this function. A file-like object must be opened in binary mode.
- n_featuresint, default=None
The number of features to use. If None, it will be inferred. This argument is useful to load several files that are subsets of a bigger sliced dataset: each subset might not have examples of every feature, hence the inferred shape might vary from one slice to another. n_features is only required if
lengthare passed a non-default value.
- dtypenumpy data type, default=np.float64
Data type of dataset to be loaded. This will be the data type of the output numpy arrays
- multilabelbool, default=False
Samples may have several labels each (see https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html)
- zero_basedbool or “auto”, default=”auto”
Whether column indices in f are zero-based (True) or one-based (False). If column indices are one-based, they are transformed to zero-based to match Python/NumPy conventions. If set to “auto”, a heuristic check is applied to determine this from the file contents. Both kinds of files occur “in the wild”, but they are unfortunately not self-identifying. Using “auto” or True should always be safe when no
lengthis passed. If
lengthare passed, the “auto” mode falls back to
zero_based=Trueto avoid having the heuristic check yield inconsistent results on different segments of the file.
- query_idbool, default=False
If True, will return the query_id array for each file.
- offsetint, default=0
Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character.
- lengthint, default=-1
If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold.
- Xscipy.sparse matrix of shape (n_samples, n_features)
- yndarray of shape (n_samples,), or, in the multilabel a list of
tuples of length n_samples.
- query_idarray of shape (n_samples,)
query_id for each sample. Only returned when query_id is set to True.
similar function for loading multiple files in this format, enforcing the same number of features/columns on all of them.
To use joblib.Memory to cache the svmlight file:
from joblib import Memory from .datasets import load_svmlight_file mem = Memory("./mycache") @mem.cache def get_data(): data = load_svmlight_file("mysvmlightfile") return data, data X, y = get_data()