Strange behaviour when passing a function into a Tensorflow dataset map method-CodePudding

This was working perfectly fine for me earlier today, but it suddenly started behaving very strangely when I restarted my notebook. I have a tf dataset that takes in numpy files and their corresponding labels as input, like so tf.data.Dataset.from_tensor_slices((specgram_files, labels)). When I take 1 item using for item in ds.take(1): print(item) I get the expected output, which is a tuple of tensors, where the first tensor contains the name of the numpy file as a bytes string and the second tensor contains the encoded label. I then have a function that reads the file using np.load() and produces a numpy array, which is then returned. This function is passed into the map() method, and it looks like this:

ds = ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
num_parallel_calls=tf.data.AUTOTUNE)

, where read_npy_file looks like this:

def read_npy_file(data):
    # 'data' stores the file name of the numpy binary file storing the features of a particular sound file
    # as a bytes string.
    # decode() is  called on the bytes string to decode it from a bytes string to a regular string
    # so that it can passed as a parameter into np.load()
    data = np.load(data.decode())
    return data.astype(np.float32)

As you can see, the mapping should create another tuple of tensors, where the first tensor is the numpy array and the second tensor is the label, untouched. This worked perfectly earlier, but now it gives the most bizarre behaviour. I placed print statements in the read_npy_file() function to see if the correct data was being passed in. I expected it to pass a single bytes string, but it instead produces this output when I call print(data) in the read_npy_file() function and take 1 item from the dataset to trigger one mapping using ds.take(1):

b'./challengeA_data/log_spectrogram/2603ebb3-3cd3-43cc-98ef-0c128c515863.npy'b'./challengeA_data/log_spectrogram/fab6a266-e97a-4935-a0c3-444fc4426fc5.npy'b'./challengeA_data/log_spectrogram/93014682-60a2-45bd-9c9e-7f3c97b83be9.npy'b'./challengeA_data/log_spectrogram/710f2430-5da3-4822-a252-6ad3601b92d9.npy'b'./challengeA_data/log_spectrogram/e757058c-91de-4381-8184-65f001c95647.npy'


b'./challengeA_data/log_spectrogram/38b12689-04ba-422b-a972-5856b05ca868.npy'
b'./challengeA_data/log_spectrogram/7c9ccc04-a2d2-4eec-bafd-0c97b3658c26.npy'b'./challengeA_data/log_spectrogram/c7cc3520-7218-4d07-9f0a-6bd7bb90a551.npy'



b'./challengeA_data/log_spectrogram/21f6060a-9766-4810-bd7c-0437f47ccb98.npy'

I didn't modify any formatting of the output.

I'd greatly appreciate any help. TFDS has been an absolute nightmare to work with haha.

Here's the full code

def read_npy_file(data):
    # 'data' stores the file name of the numpy binary file storing the features of a particular sound file
    # as a bytes string.
    # decode() is  called on the bytes string to decode it from a bytes string to a regular string
    # so that it can passed as a parameter into np.load()
    print(data)
    data = np.load(data.decode())
    return data.astype(np.float32)

specgram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, labels))

specgram_ds = specgram_ds.map(
                    lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
                    num_parallel_calls=tf.data.AUTOTUNE)

num_files = len(train_df)
num_train = int(0.8 * num_files)
num_val = int(0.1 * num_files)
num_test = int(0.1 * num_files)

specgram_ds = specgram_ds.shuffle(buffer_size=1000)
specgram_train_ds = specgram_ds.take(num_train)
specgram_test_ds = specgram_ds.skip(num_train)
specgram_val_ds = specgram_test_ds.take(num_val)
specgram_test_ds = specgram_test_ds.skip(num_val)

# iterating over one item to trigger the mapping function
for item in specgram_ds.take(1):
    pass

Thanks!

CodePudding user response：

Your logic seems to be fine. You are actually just observing the behavior of tf.data.AUTOTUNE in combination with print(*). According to the docs :

If the value tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available CPU.

You can run the following code a few times to observe the changes:

import tensorflow as tf
import numpy as np

def read_npy_file(data):
    # 'data' stores the file name of the numpy binary file storing the features of a particular sound file
    # as a bytes string.
    # decode() is  called on the bytes string to decode it from a bytes string to a regular string
    # so that it can passed as a parameter into np.load()
    print(data)
    data = np.load(data.decode())
    return data.astype(np.float32)

# Create dummy data
for i in range(4):
  np.save('{}-array'.format(i), np.random.random((5,5)))


specgram_files = ['/content/0-array.npy', '/content/1-array.npy', '/content/2-array.npy', '/content/3-array.npy']
labels = [1, 0, 0, 1]
specgram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, labels))

specgram_ds = specgram_ds.map(
                    lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
                    num_parallel_calls=tf.data.AUTOTUNE)


num_files = len(specgram_files)
num_train = int(0.8 * num_files)
num_val = int(0.1 * num_files)
num_test = int(0.1 * num_files)

specgram_ds = specgram_ds.shuffle(buffer_size=1000)
specgram_train_ds = specgram_ds.take(num_train)
specgram_test_ds = specgram_ds.skip(num_train)
specgram_val_ds = specgram_test_ds.take(num_val)
specgram_test_ds = specgram_test_ds.skip(num_val)

for item in specgram_ds.take(1):
    pass

Also see this. Finally, note that using tf.print instead of print should get ride of any side effects.