How to use one hot encoded data in Keras input-CodePudding

I'm trying to build a model that is using a one-hot encoded input next to a numeric feature input.

I therefore created a minimal example.

We have create dataframe:

df = pd.DataFrame(
{
  'val': [1, 12, 123, 512, 12], 
  'cat': [["A", "B"], ["A"], ["C"], ["C", "A", "B"], ["A"]],
  'label': [0,1,1,0,0]
})

The categorical data gets encoded to the one hot representation via multi-hot encoding:

cat_list = list(set([item.strip() for sublist in df['cat'] for item in sublist]))

index = tf.keras.layers.StringLookup(vocabulary=cat_list)
encoder = tf.keras.layers.CategoryEncoding(num_tokens=index.vocabulary_size(), output_mode="multi_hot")

# Learn the set of possible values and assign them a fixed integer index.
df['cat_encoded'] = df.cat.apply(lambda x: encoder(index(x)))

this creates some eager tensors, where one row element would look like:

<tf.Tensor: shape=(4,), dtype=float32, numpy=array([0., 0., 1., 1.], dtype=float32)>

I then create a model, I know it doesn't make a lot of sense but it serves the problem, with:

name_input = tf.keras.layers.Input(shape=(), name="val", dtype=tf.int32)

cat_input = tf.keras.layers.Input(shape=(), name="cat", dtype=tf.int32)

raw_inputs = {
    "val": name_input, 
    "cat": cat_input, 
}
processed_outputs = {
    "embedded_name": embedded_name, 
}
model_concat = tf.keras.layers.Concatenate(axis=-1)(
  [name_input, cat_input]
)

model = tf.keras.Model(inputs=raw_inputs, outputs=model_concat)

When I try to train the model with the data I'm not able to really find the right format. I try to fit the data with:

model.compile()
model.fit(
  x=[np.array(df['val']), np.vstack(df['cat_encoded'])],
  y=df['label'],
)

This leads to an error with:

ValueError: Layer "model" expects 2 input(s), but it received 3 input tensors. Inputs received: [<tf.Tensor 'data:0' shape=(None,) dtype=string>, <tf.Tensor 'data_1:0' shape=(None,) dtype=string>, <tf.Tensor 'data_2:0' shape=(None, 664) dtype=float32>]

I also tried setting the shape of the categorical input but it didn't seem to help, as well as using np.array for the categorical encoding.

CategoryEncoding inside the model

Furthermore I also tried having the categorical encoding inside the model, which seems not to work due to the fact that the number of categories is not static and would lead to an error with:

all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2 and the array at index 1 has size 1

How can I fix this and process the one-hot encoded data as input to my keras model?

CodePudding user response：

Maybe try something like this:

import pandas as pd
import tensorflow as tf
import numpy as np

df = pd.DataFrame(
{
  'val': [1, 12, 123, 512, 12], 
  'cat': [["A", "B"], ["A"], ["C"], ["C", "A", "B"], ["A"]],
  'label': [0,1,1,0,0]
})
cat_list = list(set([item.strip() for sublist in df['cat'] for item in sublist]))

index = tf.keras.layers.StringLookup(vocabulary=cat_list)
encoder = tf.keras.layers.CategoryEncoding(num_tokens=index.vocabulary_size(), output_mode="multi_hot")

# Learn the set of possible values and assign them a fixed integer index.
df['cat_encoded'] = df.cat.apply(lambda x: encoder(index(x)))

name_input = tf.keras.layers.Input(shape=(1, ), name="val", dtype=tf.float32)

cat_input = tf.keras.layers.Input(shape=(4, ), name="cat", dtype=tf.float32)

model_concat = tf.keras.layers.Concatenate(axis=-1)(
  [name_input, cat_input]
)

model = tf.keras.Model(inputs=[name_input, cat_input], outputs=model_concat)
model.compile(loss='mse')
model.fit(
  x=[np.array(df['val'])[..., None], np.vstack(df['cat_encoded'])],
  y=df['label'],
)

Note that I temporarily changed the dtype of the Input layers to tf.float32, because gradients cannot be calculated directly for integer inputs and your model is currently not doing much. With this line np.array(df['val'])[..., None], I am adding an additional dimension to val, so it can be fed to your model.