Home > OS >  How to train on a tensorflow_datasets dataset
How to train on a tensorflow_datasets dataset

Time:07-27

I'm playing around with tensorflow to become a bit more familiar with the overall workflow. To do this I thought I should start with creating a simple classifier for the well known Iris dataset.

I load the dataset using:

ds = tfds.load('iris', split='train', shuffle_files=True, as_supervised=True)

I use the following classifier:

model = keras.Sequential([
    keras.layers.Dense(10,activation="relu"),
    keras.layers.Dense(10,activation="relu"),
    keras.layers.Dense(3, activation="softmax")
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

I then try to fit the model using:

model.fit(ds,batch_size=50, epochs=100)

This gives the following error:

Input 0 of layer "dense" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (4,)

    Call arguments received by layer "sequential" (type Sequential):
      • inputs=tf.Tensor(shape=(4,), dtype=float32)
      • training=True
      • mask=None

I also tried defining the model using the functional API(as this was my orignal goal to learn)

inputs = keras.Input(shape=(4,), name='features')

first_hidden = keras.layers.Dense(10, activation='relu')(inputs)
second_hidden = keras.layers.Dense(10, activation="relu")(first_hidden)

outputs = keras.layers.Dense(3, activation='softmax')(second_hidden)

model = keras.Model(inputs=inputs, outputs=outputs, name="test_iris_classification")

I now get the same error as before but this time with a warning:

WARNING:tensorflow:Model was constructed with shape (None, 4) for input KerasTensor(type_spec=TensorSpec(shape=(None, 4), dtype=tf.float32, name='features'), name='features', description="created by layer 'features'"), but it was called on an input with incompatible shape (4,).

I suspect this is something quite fundamental that haven't understood but I have not been able to figure it out, despite several hours of googling.

PS: I also tried to download the whole dataset from the UCI Machine Learning Repository as a CSV file.

I read it in like this:

ds = pd.read_csv("iris.data", header=None)
labels = []
for name in ds[4]:
    if name == "Iris-setosa":
        labels.append(0)
    elif name == "Iris-versicolor":
        labels.append(1)
    elif name == "Iris-virginica":
        labels.append(2)
    else:
        raise ValueError(f"Name wrong name: {name}")
labels = np.array(labels)
features = np.array(ds[[0,1,2,3]])

And fit it like this:

model.fit(features, labels,batch_size=50, epochs=100)

And I'm able to fit the model to this dataset without any problems for both the sequential and the functional API. Which makes me suspect my misunderstanding has something to do with how the tensorflow_datasets works.

CodePudding user response:

Set the batch size when loading your data:

import tensorflow_datasets as tfds
import tensorflow as tf

ds = tfds.load('iris', split='train', shuffle_files=True, as_supervised=True, batch_size=10)
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10,activation="relu"),
    tf.keras.layers.Dense(10,activation="relu"),
    tf.keras.layers.Dense(3, activation="softmax")
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
model.fit(ds, epochs=100)

Also regarding model.fit, the docs state:

Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32. Do not specify the batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

  • Related