MNIST dataset to numpy arrays-CodePudding

I am following this tutorial and I understand 99% of it.

https://github.com/Azure/MachineLearningNotebooks/blob/master/tutorials/compute-instance-quickstarts/quickstart-azureml-in-10mins/quickstart-azureml-in-10mins.ipynb

I know how to do classification tasks with tabular data, but in this case the data are images, so I am confused with the following piece of code which is not explained in the tutorial

X_train = (
    load_data(
        glob.glob(
            os.path.join(data_folder, "**/train-images-idx3-ubyte.gz"), recursive=True
        )[0],
        False,
    )
    / 255.0
)
X_test = (
    load_data(
        glob.glob(
            os.path.join(data_folder, "**/t10k-images-idx3-ubyte.gz"), recursive=True
        )[0],
        False,
    )
    / 255.0
)
y_train = load_data(
    glob.glob(
        os.path.join(data_folder, "**/train-labels-idx1-ubyte.gz"), recursive=True
    )[0],
    True,
).reshape(-1)
y_test = load_data(
    glob.glob(
        os.path.join(data_folder, "**/t10k-labels-idx1-ubyte.gz"), recursive=True
    )[0],
    True,
).reshape(-1)

Can you explain me whats happening here? my guess is that images are converted into arrays of 1,0 which represent the image?

Or am I wrong, some documentation links that you find relevant would be useful too

CodePudding user response：

The MNIST data set is a collection of images of handwritten digits. Each image is an 8 bit grayscale image with 28x28 pixels. The fact that the image is an 8 bit image means that each pixel is an integer between 0 and 255.

In fact images isn't much different from tabular data, it is just a 2D(3D for RGB images) grid of numbers.

The tutorial you are refering to says:

Note this step requires a load_data function that's included in an utils.py file. This file is placed in the same folder as this notebook. The load_data function simply parses the compressed files into numpy arrays.

So the load_data method reads your dataset files and converts it to a numpy array. It seems like in stead of reading it as a (NumberOfImages,28,28) array the method flattens each array into a single vector of lenght784, making the final shape (NumberOfImages,784) This makes sense if your plan is not to use any sort of convolutional layers in your model.

Most machine learning algorithms works better if the data is normalized before use, this is why you divide the images with 255. Because then all values will be between 0 and 1. This is just a quick and dirty method of normalizing data but works fine for a small dataset such as mnist. There are many other ways of initilizing data for machine learning, one of the most famous ones is the Xavier Initilaization.

When you plot the images as they do in the tutorial, you reshape this 784 vector into a (28,28) array, then matplotlib displays it based on the intensity values.