I’ve been getting back into machine learning so I can hopefully still have a job in 5 years. When I first played around with ML a couple of years ago all of the introductory tutorial use the fashion-mnist dataset. for example, this is a snippet from Tenserflows image classification tutorial.

fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

While this tutorial is supposed to be a quick start it glosses over a huge part of the learning process if you’re trying to learn about how these models are built from scratch. What does the raw data actually look like? How are the labels structured? How is this labeled data parsed and put into a usable format?

When I set out to answer these questions by just downloading the dataset itself I was left with more questions. The official fashion-MNIST git repo tells us to load the data set with the following lines in the README.md

import mnist_reader
X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')

These are the files in data/fashion

fashion
├── t10k-images-idx3-ubyte.gz
├── t10k-labels-idx1-ubyte.gz
├── train-images-idx3-ubyte.gz
└── train-labels-idx1-ubyte.gz

When you unzip these files and run the file command on them it only describes them as data

gunzip t10k-images-idx3-ubyte.gz
file t10k-images-idx3-ubyte   
# t10k-images-idx3-ubyte: data

The dataset is supposed to be comprised of 70,000 28x28 grey scale images 10,000 for training and there respective labels. But everywhere you download the dataset, the actual images seem to be missing.

The official fashion-MINST white paper on the dataset references these ubyte.gz as well but doesn’t explain them. Perhaps they’re using the format of the original MINST dataset?

I couldn’t find any papers for the original hand drawn MINST dataset but I did find a format specification on the web site the original dataset was posted. (Used an archive.org snapshot because https://yann.lecun.com/exdb/mnist is sometimes blocked)

Interesting Quote from this page:

These files are not in any standard image format. You have to write your own (very simple) program to read them.

A the bottom of this page the page it includes what the “IDX” format consists of

THE IDX FILE FORMAT

the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.

The basic format is

magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data

The magic number is an integer (MSB first). The first 2 bytes are always 0.

The third byte codes the type of the data:
0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)

The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices…. The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors). The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

It also provide a breakdown of the dataset files in reference to this format

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). 

My understanding of this

Assuming fashion-MINST and the original MINST dataset are encoded with the same format, things start making sense when we look at the first line
of a hexdump of unziped train-images-idx3-ubyte

00000000: 0000 0803 0000 ea60 0000 001c 0000 001c  .......`........
  • Magic Number: 0803
    • 08 signifies the data type as unsigned byte, meaning our values should be between 0 and 255, signifying the grey scale color value of a pixel
    • 03" indicates that there are three dimensions to the data. These dimensions are rows,columns and pixel values
  • Number of images: ea60
    • ea60 translates to 60000, which lines up with the fact that this file should contain the traning images
  • Number of Rows: 001c
    • 001c translates to 28, which is the number of pixels in a row for each image
  • Number of Columns: 001c
    • 001c translates to 28, which is the number of pixels in a column for each image

After these first bytes, I’m assuming every subsequent byte is an integer between 0 and 255 and every 784 bytes denotes a new image.

A similar format is used for the labels dataset, except every the data is in one dimension and every subsequent byte represents a label for every 784 bytes in the labels file

labels hex dump

00000000: 0000 0801 0000 ea60 0900 0003 0002 0702  .......`........

To understand how these are parsed into usable data let’s look at the load_mnist function seen previously.

def load_mnist(path, kind='train'):
    import os
    import gzip
    import numpy as np

    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels

After gunziping the data the labels are read into the labels variable from the bytes array read from the file. the numpy.frombuffer method used to make a numpy.ndarray from the bytes array and the offest is set to 8 to avoid reading in the the dimension specifications.

labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                        offset=8)

The same is done to the image variable but the offset starts at 16. It’s “reshaped” using the length of labels which is 60000 representing the number of images in the dataset and 784 which is the number of pixels each image contains

before reshape

images.shape(): (47040000,)

after reshape

images.shape(): (60000, 784)

This was confusing for me to understand but you can think about it like takeing a 1D array with 47040000 elements and splitting it into a 2D array with 60000 rows each with 784 elements.

foo = np.array([1, 2, 3, 4, 5, 6])
bar = foo.reshape(3,2)
print(bar)
""" output
[[1 2]
 [3 4]
 [5 6]]
"""

The examples from numpys docs might help too

Looking at the data

If we go back two our two data reading example from earlier

from the MIST git repo

import mnist_reader
X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')

from Tenserflow

fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

Tenserflow does a better job at explaining what we are getting from the dataset. so let’s rename the variables in the mist snippet and print out there shapes

train_images, train_labels = mnist_reader.load_mnist('data/fashion', kind='train')
test_images, test_labels= mnist_reader.load_mnist('data/fashion', kind='t10k')

print(f"train_images.shape(): {train_images.shape}")
print(f"train_labels.shape(): {train_labels.shape}")
print(f"test_images.shape(): {test_images.shape}")
print(f"test_labels.shape(): {test_labels.shape}")

as you can see the output shows our four arrays in ready for the classic 80/20 split of training and testing

train_images.shape(): (60000, 784)
train_labels.shape(): (60000,)
test_images.shape(): (10000, 784)
test_labels.shape(): (10000,)

if we print out print(train_images[0]) we can see an array that looks like it could be an image of something

In order to view this image we need to use matplotlib imgshow method which can take in a numpy.ndarray and convert it to an image, however if we just run the following it will fail

import matplotlib.pyplot as plt
plt.imshow(train_images[0])
plt.show()
# TypeError: Invalid shape (784,) for image data

This is because train_images[0] is still a 1D array with 784 pixel values which would make for a goofy image. To view the image we need to reshape the array into a 28x28 2D array which will represent our image at the correct aspect ratio.

plt.imshow(train_images[0].reshape(28,28))
plt.show()

This is good but, let’s convert it to greyscale with cmap='gray and use the same index of the image in train_images to show the label

plt.imshow(train_images[0].reshape(28,28), cmap='gray')
plt.title(f"Label: {train_images[0]}")
plt.show()

One more thing, the labels are stored as integers and we need to convert them to strings based of this key:

Label Description
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
plt.imshow(train_images[0].reshape(28,28), cmap='gray')
plt.title(f"Label: {class_names[train_labels[0]]}")
plt.show()

I’m not sure why the MNIST datasets are stored in such a hacky way, but if you’re a beginner this format can be a huge obstacle. Especially it doesn’t seem like many other models use this hyper-specific format. The kaggle dataset includes csv files which greatly simplifies the complexity of understanding what you’re looking at, but many tutorials opt to use the IDX format. Anyway I hope this helps someone starting out with ML.