I am quite new to TensorFlow.
I have the this dataset which is available on kaggle. I wanted to read only the files from 2018
which are available in the raw
directory. I can list the files using tensorflow in the following manner:
import tensorflow as tf
data_2018 = tf.data.Dataset.list_files("./raw/*2018*")
However, this does not loads the data. Plus I want to choose the columns which should be loaded. For example I would like to load [1, 3, 6, 8, 10]
columns. How can I load the data from multiple CSV files and also choose the columns?
CodePudding user response:
Try using tf.data.experimental.make_csv_dataset
:
import pandas as pd
import tensorflow as tf
# Create dummy data
df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
'mask': ['red', 'purple'],
'weapon': ['sai', 'bo staff']})
df.to_csv("/content/raw/2_2018_2.csv", index=False)
df.to_csv("/content/raw/2_2018_3.csv", index=False)
Load csv files and select specific columns:
dataset = tf.data.experimental.make_csv_dataset(file_pattern = "/content/raw/*2018*", batch_size=2, num_epochs=1, select_columns = ['name', 'mask'])
for x in dataset:
print(x['name'], x['mask'])
tf.Tensor([b'Donatello' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'red'], shape=(2,), dtype=string)
tf.Tensor([b'Donatello' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'red'], shape=(2,), dtype=string)
tf.Tensor([b'Raphael' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'red' b'red'], shape=(2,), dtype=string)
tf.Tensor([b'Donatello' b'Donatello'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'purple'], shape=(2,), dtype=string)