Home > Software engineering >  How do you convert the pandas DataFrame to tensorflow.python.data.ops.dataset_ops.PrefetchDataset
How do you convert the pandas DataFrame to tensorflow.python.data.ops.dataset_ops.PrefetchDataset

Time:09-28

Given that I have the below Tensorflow Dataset:

import tensorflow_datasets as tfds
(raw_train_ds, raw_val_ds, raw_test_ds), info = tfds.load('ag_news_subset',
                                                          split=['train[:90%]',
                                                                 'train[-90%:]',
                                                                 'test'],
                                                          with_info=True)

The type of raw_train_ds is tensorflow.python.data.ops.dataset_ops.PrefetchDataset

I need to apply the below remove_stop_words() method to the description features of the dataset, so i should convert it to DataFrame and i can convert this using the below code:

train_sample_df = \
    tfds.as_dataframe(raw_train_ds.shuffle(batch_size),
                      ds_info=info)[['description', 'label']]

and I must apply remove_stop_words() to this dataframe as below:

def remove_stop_words(tweet):
    tweet = tweet.decode("utf-8")
    #print(tweet," ",type(tweet))
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
                 "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did",
                 "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have",
                 "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself",
                 "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's",
                 "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only",
                 "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd",
                 "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs",
                 "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're",
                 "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we",
                 "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's",
                 "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll",
                 "you're", "you've", "your", "yours", "yourself", "yourselves"]
    tweet = tweet.lower()
    words = tweet.split(' ')
    non_stop_words = [w for w in words if w not in stopwords]
    return (" ").join(non_stop_words)

train_sample_df['description'] = train_sample_df['description'].apply(lambda tweet: remove_stop_words(tweet) if tweet is not np.nan else tweet)

and finally I need to convert train_sample_df to the tensorflow.python.data.ops.dataset_ops.PrefetchDataset again, but i don't know how to do it.

Any idea ?


Update:

Thanks to @AloneTogether , I have converted the pandas DataFrame to PrefetchDataset using the below code:

raw_train_ds = tf.data.Dataset.from_tensor_slices((train_sample_df['description'], train_sample_df['label'])).prefetch(20)

but in the next step i need to run

def convert_ds_to_tuple(sample):

    """ the original dataset is of the form of a dict
     {description: (), label: (), title: ()}

      TF's model.fit() method required datasets to be of the form
      A tf.data dataset that returns a tuple of (inputs, targets)"""

    print("type(sample['description'])  : ",type(sample['description']))
    return sample['description'], sample['label']


# converting all datasets from dicts to tuples
raw_train_ds = raw_train_ds.map(convert_ds_to_tuple).batch(batch_size)
raw_val_ds = raw_val_ds.map(convert_ds_to_tuple).batch(batch_size)
raw_test_ds = raw_test_ds.map(convert_ds_to_tuple).batch(batch_size)

and in the line raw_train_ds = raw_train_ds.map(convert_ds_to_tuple).batch(batch_size), i am facing with the error:

TypeError: convert_ds_to_tuple() takes 1 positional argument but 2 were given

So, i guess, still the should be something wrong with the raw_train_ds as I do not have the same problem with raw_val_ds

CodePudding user response:

Try using tf.data.Dataset.from_tensor_slices and then do what you want:

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((train_sample_df['description'], train_sample_df['label'])).prefetch(10) # call batch, shuffle etc.

I am not sure you know what you are doing, but you can try:

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((train_sample_df['description'], train_sample_df['label'])).prefetch(10) 
dataset = dataset.map(lambda x, y: {'description': x, 'label': y})

def convert_ds_to_tuple(sample):
    return sample['description'], sample['label']

dataset = dataset.map(convert_ds_to_tuple).batch(32)
  • Related