I have data with different dtypes and I would like to build a windowed dataset. Previously, I asked this question where I dealt with homogeneous data (used one-hot encoding). If I have a dataframe with different dtypes I need to use a dictionary and the accepted solution that uses flat_map
doesn't work (AttributeError: 'dict' object has no attribute 'batch'
). For example, if I don't use flat_map
:
window_size_x = 3
window_size_y = 2
shift_size = 1
x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)
x = x[:-window_size_y]
y = y[window_size_x:]
ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True)
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True)
dataset = tf.data.Dataset.zip((ds_x, ds_y))
for i, j in dataset.take(1):
print(i, j)
Output:
{'col1': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>, 'col2': <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>} <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
When feed this dataset into the model with multiple inputs, I get the following error:
TypeError: Inputs to a layer should be tensors. Got: <tensorflow.python.data.ops.dataset_ops._NestedVariant object at 0x7f4626fbfc10>
So, I need to transform my _NestedVariantdata
into tensors. Thank you!
CodePudding user response:
You can do this with bit of a hack. It's gets a bit messy when you try to zip()
data in different structures (e.g. a dict of arrays (x
) and a plain array (y
)). I'm not sure if it's possible (I got weird errors). So I'm collating both x
and y
to a single dict.
import tensorflow as tf
import pandas as pd
import numpy as np
window_size_x = 3
window_size_y = 2
shift_size = 1
x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)
x = x[:-window_size_y]
y = y[window_size_x:]
ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True)
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True)
dataset = tf.data.Dataset.zip((ds_x, ds_y)).flat_map(
# zip the data in the dict to a tf.data.Dataset
lambda window_x, window_y: tf.data.Dataset.zip(
# Here we are collating x and y to a single dict
{**dict([(k, v.batch(window_size_x)) for k, v in window_x.items()]), **{"y": window_y.batch(window_size_y)}}
)
)
If you don't like both x
and y
being in the same dict, you can break it back using the map()
dataset = dataset.map(lambda data_dict: (dict(<all of k,v pairs except of key y>), data_dict["y"]))
CodePudding user response:
Maybe something like this:
import tensorflow as tf
import pandas as pd
import numpy as np
window_size_x = 3
window_size_y = 2
shift_size = 1
x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10)})
y = np.arange(10)
x = x[:-window_size_y]
y = y[window_size_x:]
ds_x = tf.data.Dataset.from_tensor_slices(dict(x)).window(window_size_x, shift=shift_size, drop_remainder=True).flat_map(lambda x: tf.data.Dataset.zip((x['col1'].batch(window_size_x), x['col2'].batch(window_size_x))))
ds_y = tf.data.Dataset.from_tensor_slices(y).window(window_size_y, shift=shift_size, drop_remainder=True).flat_map(lambda x: x.batch(window_size_y))
dataset = tf.data.Dataset.zip((ds_x, ds_y))
for i, j in dataset.take(1):
print(i, j)
(<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'a', b'b', b'c'], dtype=object)>, <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 2])>) tf.Tensor([3 4], shape=(2,), dtype=int64)