I have data in a pd.DataFrame
column that has the following format:
col
0 ['str1', 'str2', 'str3']
1 []
2 ['str1']
3 ['str20']
I using the following code to construct a lookup layer:
lookup_layer = tf.keras.layers.StringLookup(max_tokens=335)
lookup_layer.adapt(df.col)
Which fails with:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
I also tried to concat the column into a single list, since the error suggested that the nested lists were the issue:
lookup_layer.adapt(itertools.chain(*df.col))
which resulted in:
AttributeError: 'str' object has no attribute 'shape'
I also tried various tf.cast
/tf.convert_to_tensor
calls, to no avail.
How can I convert my DataFrame
string list column into something Tensorflow accepts?
CodePudding user response:
As an alternative, you can use tf.ragged.constant over your col
pd.Series.
lookup_layer = tf.keras.layers.StringLookup(max_tokens=335)
lookup_layer.adapt(tf.ragged.constant(df.col))
CodePudding user response:
You have to convert your list of string lists into a single list and then your StringLookup
layer should work:
import pandas as pd
import tensorflow as tf
import numpy as np
d = {'col': [['str1', 'str2', 'str3'], [], ['str1', 'str2', 'str3'], ['str1', 'str2', 'str3']]}
df = pd.DataFrame(data=d)
lookup_layer = tf.keras.layers.StringLookup(max_tokens=335)
flattened_data = sum(list(df.col), [])
lookup_layer.adapt(flattened_data)
print(lookup_layer.get_vocabulary())
['[UNK]', 'str3', 'str2', 'str1']
Also check out this post on the performance of different list flattening methods.