I am training a Tensorflow Keras CNN over images, too much training data to fit into memory. I've got a tf.Dataset preprocessing pipeline that reads the images from HDF5 files using a dataset.map()
pipeline step. Now I'm trying to normalize the numeric image data to 0 mean and unit variance.
I'm following this example from this guide, except that I have that .map()
in there:
def load_features_from_hdf5(filename):
spec = tf.TensorSpec(feature_shape, dtype=tf.dtypes.float32, name=None)
dataset = tfio.IODataset.from_hdf5(filename, "/features", spec=spec) # returns a Dataset
feature = dataset.get_single_element()
feature.set_shape(feature_shape)
return feature
train_x = tf.data.Dataset.from_tensor_slices(filenames).map(load_features_from_fbank, num_parallel_calls=tf.data.AUTOTUNE)
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(train_x.take(1000))
train_x_normalized = normalizer(train_x) # <-- ValueError
adapt()
successfully computes the mean and variance from the dataset. But when I try to actually apply normalization of values on the exact same dataset, it errors while trying to convert my ParallelMapDataset to an EagerTensor.
ValueError: Attempt to convert a value (<ParallelMapDataset shapes: (41, 682, 1), types: tf.float32>) with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>) to a Tensor.
How can I get this working? Since the data is so large, I wouldn't think I want to make anything eager until training starts. Should I make the normalization an explicit pipeline step on the Dataset? Or an explicit layer on the model itself? (If the latter case, how can I bring the mean and variance values from training time to inference time in another process?)
CodePudding user response:
You could try something like this:
import tensorflow as tf
# Create dummy data
train_x = tf.data.Dataset.from_tensor_slices((tf.random.normal((100, 28, 28, 3)), tf.random.normal((100, 1)))).batch(10)
normalizer = tf.keras.layers.Normalization(axis=None)
# Adapt
normalizer.adapt(train_x.map(lambda x, y: x))
# Apply to images
train_x_normalized = train_x.map(lambda x, y: (normalizer(x), y))
Example:
for x, y in train_x_normalized.take(1):
print(tf.reduce_mean(x), tf.math.reduce_variance(x))
tf.Tensor(0.00930768, shape=(), dtype=float32) tf.Tensor(1.0023469, shape=(), dtype=float32)
Or, as you mentioned in your question, your can use the normalization layer as part of your model.