Home > Software engineering >  Extract attributes from the original dataframe used to create a tensorflow dataset
Extract attributes from the original dataframe used to create a tensorflow dataset

Time:03-04

I have the following dataframe df:

             sales
2015-10-05  -0.462626
2015-10-06  -0.540147
2015-10-07  -0.450222
2015-10-08  -0.448672
2015-10-09  -0.451773
... ...
2019-10-16  -0.594413
2019-10-17  -0.620770
2019-10-18  -0.586660
2019-10-19  -0.586660
2019-10-20  -0.671934
11340 rows × 1 columns

which I turn into a tf.data.Dataset like so:

data = np.array(df)
ds = tf.keras.utils.timeseries_dataset_from_array(
    data=data,
    targets=None,
    sequence_length=4,
    sequence_stride=1,
    shuffle=False,
    batch_size=1,)

The dataset gives me records looking as such

print(next(iter(ds)))
tf.Tensor(
[[[-0.4626256 ]
  [-0.54014736]
  [-0.4502221 ]
  [-0.44867167]]], shape=(1, 4, 1), dtype=float32)

Which I use for training my ML model, however, I need a way of finding the dates corresponding to the values I fetch from the dataset. Using the example fetch from the dataset above, I want to find the dates corresponding to those consecutive values, which from the dataframe we can see is [2015-10-05, 2015-10-06, 2015-10-07, 2015-10-08]. Ideally, I would like to get other attributes as well if the dataframe has several columns. Is there a way of doing so?

CodePudding user response:

You could try using another dataset as a lookup. That way you can add further attributes if needed:

import pandas as pd
import numpy as np
import tensorflow as tf

df = pd.DataFrame(data={'date': ['2015-10-05', '2015-10-06', '2015-10-07', '2015-10-08', '2015-10-09', '2019-10-16', '2019-10-17', '2019-10-18', '2019-10-19', '2019-10-20'],
                        'sales': [-0.462626, -0.540147, -0.450222, -0.448672, -0.451773, -0.594413, -0.620770, -0.586660, -0.586660, -0.671934]})


data = np.array(df['sales'])
ds = tf.keras.utils.timeseries_dataset_from_array(
    data=data,
    targets=None,
    sequence_length=4,
    sequence_stride=1,
    shuffle=False,
    batch_size=1,)

d = tf.data.Dataset.from_tensor_slices((df['date'].to_numpy())).batch(1)
dates = d.flat_map(tf.data.Dataset.from_tensor_slices).window(4, shift=1, stride=1).flat_map(lambda x: x.batch(4)).batch(1)
d = tf.data.Dataset.zip((dates, ds))

def lookup(tensor, dataset):
  dataset = dataset.filter(lambda x, y: tf.reduce_all(tf.equal(y, tensor)))
  return [x.numpy().decode('utf-8') for x in list(dataset.map(lambda x, y: tf.squeeze(x, axis=0)))[0]]

result = lookup(next(iter(ds)), d)
print(result)
['2015-10-05', '2015-10-06', '2015-10-07', '2015-10-08']
  • Related