How to stream a multi-file (b, t, f)-shaped data into Tensorflow Dataset-CodePudding

I have a large data which I want to load into a Tensorflow Dataset in order to train a LSTM net. I want to use streaming functionality rather than reading the whole data into memory due to the size of my data. I am struggling with reading my data so that each sample i is properly shaped as (t_i, m).

Sample code to replicate:

# One hundred samples, each with three features
# Second dim is time-steps for each sample. I will
# randomize this in a step below
x = np.random.randn(100,10,3)
# One hundred {0,1} labels
y = (np.random.rand(100)>0.5)*1
y=y.reshape((-1,1))

# Save each sample in its own file
for i in range(len(x)):
  cat = y[i][0]
  data = x[i]
  # Simulate random length of each sample
  data = data[:np.random.randint(4,10),:]
  fname = 'tmp_csv/{:.0f}/{:03.0f}.csv'.format(cat,i)
  np.savetxt(fname, data, delimiter=',')

Now I have one hundred csv files, each with a single sample of size (t_i, 3). How can I read these files back into a Tensorflow Dataset while maintaining the shape for each sample?

I tried serialization (but didn't know how to do it properly), flattening so that each sample is in one row (but didn't know how to handle variable row size and how to reshape), and I tried vanilla make_csv_dataset. Here is my make_csv_dataset attempt:

ds = tf.data.experimental.make_csv_dataset(
  file_pattern = "tmp_csv/*/*.csv",
  batch_size=10, num_epochs=1,
  num_parallel_reads=5,
  shuffle_buffer_size=10,
  header=False,
  column_names=['a','b','c']
)

for i in ds.take(1):
  print(i)

...but this results in each sample being of shape (1,3).

CodePudding user response：

The problem is that make_csv_dataset is interpreting every row in each csv file as one sample. You could try something like this, but I am not sure how efficient it is for your use case:

import tensorflow as tf
import numpy as np

# One hundred samples, each with three features
# Second dim is time-steps for each sample. I will
# randomize this in a step below
x = np.random.randn(100,10,3)
# One hundred {0,1} labels
y = (np.random.rand(100)>0.5)*1
y=y.reshape((-1,1))

# Save each sample in its own file
for i in range(len(x)):
  cat = y[i][0]
  data = x[i]
  # Simulate random length of each sample
  data = data[:np.random.randint(4,10),:]
  fname = 'tmp_csv/{:.0f}{:03.0f}.csv'.format(cat,i)
  np.savetxt(fname, data, delimiter=',')

def columns_to_tensor(data_from_one_csv):
  ta = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
  for i, t in enumerate(data_from_one_csv):
    ta = ta.write(tf.cast(i, dtype=tf.int32), tf.stack([t[0], t[1], t[2]], axis=0))
  return ta.stack()

files = tf.data.Dataset.list_files("tmp_csv/*.csv")
ds = files.map(lambda file: tf.data.experimental.CsvDataset(file, record_defaults=[tf.float32, tf.float32, tf.float32], header=False))
ds = ds.map(columns_to_tensor)
for i,j in enumerate(ds):
  print(i, j.shape)

0 (5, 3)
1 (9, 3)
2 (5, 3)
3 (6, 3)
4 (8, 3)
5 (7, 3)
6 (6, 3)
7 (8, 3)
8 (8, 3)
9 (7, 3)
10 (9, 3)
11 (9, 3)
12 (7, 3)
13 (9, 3)
14 (4, 3)
15 (5, 3)
16 (6, 3)
17 (6, 3)
18 (8, 3)
19 (8, 3)
20 (8, 3)
21 (9, 3)
22 (9, 3)
23 (7, 3)
24 (8, 3)
25 (8, 3)
26 (5, 3)
27 (7, 3)
28 (5, 3)
29 (8, 3)
30 (9, 3)
31 (6, 3)
32 (6, 3)
33 (7, 3)
34 (6, 3)
35 (9, 3)
36 (9, 3)
37 (5, 3)
38 (9, 3)
39 (9, 3)
40 (7, 3)
41 (7, 3)
42 (7, 3)
43 (6, 3)
44 (9, 3)
45 (4, 3)
46 (9, 3)
47 (6, 3)
48 (9, 3)
49 (8, 3)
50 (7, 3)
51 (4, 3)
52 (4, 3)
53 (6, 3)
54 (7, 3)
55 (7, 3)
56 (9, 3)
57 (7, 3)
58 (5, 3)
59 (7, 3)
60 (8, 3)
61 (8, 3)
62 (5, 3)
63 (5, 3)
64 (7, 3)
65 (6, 3)
66 (6, 3)
67 (7, 3)
68 (6, 3)
69 (9, 3)
70 (5, 3)
71 (4, 3)
72 (8, 3)
73 (8, 3)
74 (6, 3)
75 (7, 3)
76 (9, 3)
77 (6, 3)
78 (5, 3)
79 (7, 3)
80 (6, 3)
81 (5, 3)
82 (4, 3)
83 (5, 3)
84 (4, 3)
85 (5, 3)
86 (4, 3)
87 (4, 3)
88 (7, 3)
89 (5, 3)
90 (4, 3)
91 (7, 3)
92 (4, 3)
93 (7, 3)
94 (4, 3)
95 (5, 3)
96 (6, 3)
97 (6, 3)
98 (7, 3)
99 (9, 3)

Afterwards, just call ds.batch with your desired batch size.