1/1

# MNIST dataset for Outliers Detection - [ MNIST4OD ]

Version 2 2019-10-08, 16:55

Version 1 2019-10-08, 16:38

dataset

posted on 2019-10-08, 16:55 authored by Giovanni StiloGiovanni Stilo, Bardh PrenkajBardh PrenkajHere we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.

The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).

We build MNIST4OD in the following way:

To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits.

For implementation simplicity we then flatten the images (28 X 28) into vectors.

Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.

The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).

See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):

MNIST_0 | 7594 | 784 | 10

MNIST_1 | 8665 | 784 | 10

MNIST_2 | 7689 | 784 | 10

MNIST_3 | 7856 | 784 | 10

MNIST_4 | 7507 | 784 | 10

MNIST_5 | 6945 | 784 | 10

MNIST_6 | 7564 | 784 | 10

MNIST_7 | 8023 | 784 | 10

MNIST_8 | 7508 | 784 | 10

MNIST_9 | 7654 | 784 | 10