MNIST dataset for Outliers Detection - [ MNIST4OD ]
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.
The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).
We build MNIST4OD in the following way:
To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits.
For implementation simplicity we then flatten the images (28 X 28) into vectors.
Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.
The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).
See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):
MNIST_0 | 7594 | 784 | 10
MNIST_1 | 8665 | 784 | 10
MNIST_2 | 7689 | 784 | 10
MNIST_3 | 7856 | 784 | 10
MNIST_4 | 7507 | 784 | 10
MNIST_5 | 6945 | 784 | 10
MNIST_6 | 7564 | 784 | 10
MNIST_7 | 8023 | 784 | 10
MNIST_8 | 7508 | 784 | 10
MNIST_9 | 7654 | 784 | 10