Fork me on GitHub


sklearn.datasets.make_multilabel_classification(n_samples=100, n_features=20, n_classes=5, n_labels=2, length=50, allow_unlabeled=True, return_indicator=False, random_state=None)

Generate a random multilabel classification problem.

For each sample, the generative process is:
  • pick the number of labels: n ~ Poisson(n_labels)
  • n times, choose a class c: c ~ Multinomial(theta)
  • pick the document length: k ~ Poisson(length)
  • k times, choose a word: w ~ Multinomial(theta_c)

In the above process, rejection sampling is used to make sure that n is never zero or more than n_classes, and that the document length is never zero. Likewise, we reject classes which have already been chosen.

Parameters :

n_samples : int, optional (default=100)

The number of samples.

n_features : int, optional (default=20)

The total number of features.

n_classes : int, optional (default=5)

The number of classes of the classification problem.

n_labels : int, optional (default=2)

The average number of labels per instance. Number of labels follows a Poisson distribution that never takes the value 0.

length : int, optional (default=50)

Sum of the features (number of words if documents).

allow_unlabeled : bool, optional (default=True)

If True, some instances might not belong to any class.

return_indicator : bool, optional (default=False),

If True, return Y in the binary indicator format, else return a tuple of lists of labels.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns :

X : array of shape [n_samples, n_features]

The generated samples.

Y : tuple of lists or array of shape [n_samples, n_classes]

The label sets.