Combine two Tensorflow Datasets with heterogeneous data-CodePudding

I have data with different dtypes and I would like to build a windowed dataset. Previously, I asked this question where I dealt with homogeneous data (used one-hot encoding). If I have a dataframe with different dtypes I need to use a dictionary and the accepted solution that uses flat_map doesn't work (AttributeError: 'dict' object has no attribute 'batch'). For example, if I don't use flat_map:

window_size_x = 3
window_size_y = 2
shift_size = 1

x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)

x = x[:-window_size_y]
y = y[window_size_x:]

ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True)
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True)
dataset = tf.data.Dataset.zip((ds_x, ds_y))

for i, j in dataset.take(1):
  print(i, j)

Output:

{'col1': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>, 'col2': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>} <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

When feed this dataset into the model with multiple inputs, I get the following error:

TypeError: Inputs to a layer should be tensors. Got: <tensorflow.python.data.ops.dataset_ops._NestedVariant object at 0x7f4626fbfc10>

So, I need to transform my _NestedVariantdata into tensors. Thank you!

CodePudding user response：

You can do this with bit of a hack. It's gets a bit messy when you try to zip() data in different structures (e.g. a dict of arrays (x) and a plain array (y)). I'm not sure if it's possible (I got weird errors). So I'm collating both x and y to a single dict.

import tensorflow as tf
import pandas as pd
import numpy as np

window_size_x = 3
window_size_y = 2
shift_size = 1


x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)

x = x[:-window_size_y]
y = y[window_size_x:]

ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True)
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True)
dataset = tf.data.Dataset.zip((ds_x, ds_y)).flat_map(
    # zip the data in the dict to a tf.data.Dataset
    lambda window_x, window_y: tf.data.Dataset.zip(
      # Here we are collating x and y to a single dict
      {**dict([(k, v.batch(window_size_x)) for k, v in window_x.items()]), **{"y": window_y.batch(window_size_y)}}      
    )
)

If you don't like both x and y being in the same dict, you can break it back using the map()

dataset = dataset.map(lambda data_dict: (dict(<all of k,v pairs except of key y>), data_dict["y"]))

CodePudding user response：

Maybe something like this:

import tensorflow as tf
import pandas as pd
import numpy as np

window_size_x = 3
window_size_y = 2
shift_size = 1

x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)

x = x[:-window_size_y]
y = y[window_size_x:]

ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True).flat_map(lambda x: tf.data.Dataset.zip((x['col1'].batch(window_size_x), x['col2'].batch(window_size_x))))
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True).flat_map(lambda x: x.batch(window_size_y))
dataset = tf.data.Dataset.zip((ds_x, ds_y))

for i, j in dataset.take(1):
  print(i, j)

(<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'a', b'b', b'c'], dtype=object)>, <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 2])>) tf.Tensor([3 4], shape=(2,), dtype=int64)