TensorFlow dataset with multi-dimensional Tensors from a CSV file-CodePudding

Is there a way, and if yes, what it is, to load a TensorFlow dataset with multi-dimensional feature Tensor from a CSV (or other format input) file?

For example, my CSV input looks like the following:

f1,  f2,  f3,                      label
0.1, 0.2, 0.1;0.2;0.3;1.1;1.2;1.3, 1
0.2, 0.3, 0.2;0.3;0.4;1.2;1.3;1.4, 0
0.3, 0.4, 0.3;0.4;0.5;1.3;1.4;1.5, 1

I'd like load a dataset from such file, e.g.

import tensorflow as tf

frames_csv_ds = tf.data.experimental.make_csv_dataset(
    'input.csv',
    header=False,
    column_names=['f1','f2','f3','label'],
    batch_size=5,
    label_name='label',
    num_epochs=1,
    ignore_errors=True,)

for batch, label in frames_csv_ds.take(1):
  for key, value in batch.items():
    print(f"{key:20s}: {value}")
  print()
  print(f"{'label':20s}: {label}")

To get the batch as:

f1 : [0.1   0.2   0.3  ]
f2 : [0.2   0.3   0.4  ]
f3 : [ [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]], [[0.2, 0.3, 0.4], [1.2, 1.3, 1.4]], [[0.3, 0.4, 0.5], [1.3, 1.4, 1.5]] ]
label : [1, 0, 1]

The snippet above is incomplete and doesn't work. Is there away to get the dataset in the illustrated form? If yes, can this be done for arrays of dimensions varying across the dataset?

CodePudding user response：

Well, you can do this by customizing some Tensorflow Functions

import tensorflow as tf

file_path = "data.csv"
dataset = tf.data.TextLineDataset(file_path).skip(1)

def parse_csv_line(line):
  # Split the line into a list of strings
  fields = tf.io.decode_csv(line, record_defaults=[[""]] * 4)
  
  f1 = tf.strings.to_number(fields[0], tf.float32)
  f2 = tf.strings.to_number(fields[1], tf.float32)
  f3 = tf.strings.to_number(tf.strings.split(fields[2], ";"), tf.float32)
  label = tf.strings.to_number(fields[3], tf.int32)
  
  return {"f1": f1, "f2": f2, "f3": f3, "label": label}

dataset = dataset.map(parse_csv_line).batch(5)

next(iter(dataset.take(1)))

{'f1': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.1, 0.2, 0.3], dtype=float32)>,
 'f2': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.2, 0.3, 0.4], dtype=float32)>,
 'f3': <tf.Tensor: shape=(3, 6), dtype=float32, numpy=
 array([[0.1, 0.2, 0.3, 1.1, 1.2, 1.3],
        [0.2, 0.3, 0.4, 1.2, 1.3, 1.4],
        [0.3, 0.4, 0.5, 1.3, 1.4, 1.5]], dtype=float32)>,
 'label': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 0, 1], dtype=int32)>}