Initialise Keras StringLookup with DataFrame list column-CodePudding

I have data in a pd.DataFrame column that has the following format:

   col
0  ['str1', 'str2', 'str3']
1  []
2  ['str1']
3  ['str20']

I using the following code to construct a lookup layer:

lookup_layer = tf.keras.layers.StringLookup(max_tokens=335)
lookup_layer.adapt(df.col)

Which fails with:

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

I also tried to concat the column into a single list, since the error suggested that the nested lists were the issue:

lookup_layer.adapt(itertools.chain(*df.col))

which resulted in:

AttributeError: 'str' object has no attribute 'shape'

I also tried various tf.cast/tf.convert_to_tensor calls, to no avail.

How can I convert my DataFrame string list column into something Tensorflow accepts?

CodePudding user response：

As an alternative, you can use tf.ragged.constant over your col pd.Series.

lookup_layer = tf.keras.layers.StringLookup(max_tokens=335)
lookup_layer.adapt(tf.ragged.constant(df.col))

CodePudding user response：

You have to convert your list of string lists into a single list and then your StringLookup layer should work:

import pandas as pd
import tensorflow as tf
import numpy as np

d = {'col': [['str1', 'str2', 'str3'], [], ['str1', 'str2', 'str3'], ['str1', 'str2', 'str3']]}
df = pd.DataFrame(data=d)

lookup_layer = tf.keras.layers.StringLookup(max_tokens=335)
flattened_data = sum(list(df.col), [])
lookup_layer.adapt(flattened_data)
print(lookup_layer.get_vocabulary())

['[UNK]', 'str3', 'str2', 'str1']

Also check out this post on the performance of different list flattening methods.