Merging text and image keras layers not working-CodePudding

Please judge me tender. I'm trying to concatenate two inputs, one for images and one for text.

I'm not an expert and I'm new with the functional API, so it's hard for me to identify the problem here.

In the code below, I confirmed that I can train both text_features and image_features models, but when I try to train the end to end model it retrieves the error:

ValueError: Failed to find data adapter that can handle input: (<class 'dict'> containing {"<class 'str'>"} keys and {"<class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>"} values), <class 'NoneType'>

I can imagine that I'm facing a rather basic problem, but the thing is that I couldn't find a simple example where both images and text are used, so I can't see where it is.

I will copy the entire code I use and will try to comment on each step so this doesn't becomes a general debugging issue.

import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import re
import string

First, I define a common batch size and seed for both the text and images datasets. Both my images and text files are saved in a single set of 25 folders.

Lets say folder one has a file called sample_1.png. It also has a file called sample_1.txt, which correspond to the text associated with the said image, stored as a single string (using json).

batch_size = 32
seed = 42

Then, I load the text data. Here, I try to follow this example: basic text classification without recurrent layers. The only difference is that my output is not binary.

raw_text_train_ds =tf.keras.utils.text_dataset_from_directory(
    'NEURAL', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=seed)
raw_text_val_ds = tf.keras.utils.text_dataset_from_directory(
    'NEURAL', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)

I follow the processing steps of the referenced example, except that I've previously treated my text for punctuation and similar.

max_features = 7000
sequence_length = 250    
vectorize_layer = layers.TextVectorization(
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)
train_text = raw_text_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label
text_train_ds = raw_text_train_ds.map(vectorize_text)
text_val_ds = raw_text_val_ds.map(vectorize_text)

Before applying the AUTOTUNE part of the mentioned example, I upload the image dataset, trying to follow this example: image classification with augmentation layer

img_height = 180
img_width = 180
img_train_ds = tf.keras.utils.image_dataset_from_directory(
  'NEURAL',
  validation_split=0.2,
  subset="training",
  seed=seed,
  image_size=(img_height, img_width),
  batch_size=batch_size)
img_val_ds = tf.keras.utils.image_dataset_from_directory(
  'NEURAL',
  validation_split=0.2,
  subset="validation",
  seed=seed,
  image_size=(img_height, img_width),
  batch_size=batch_size)

I wonder if applyoing the following data augmantation layer is causing some sort of missmatch, but I don't think so. Again, I'm pretty sure my mistake is more basic than anything else.

data_augmentation = keras.Sequential(
  [
    layers.RandomRotation(0.04,
                         input_shape=(img_height,
                                      img_width,
                                      3)),
    layers.RandomZoom(0.1),
  ]
)

As both referenced examples recommend to apply the folowing AUTOTUNING, I do it for both data sets at once.

AUTOTUNE = tf.data.AUTOTUNE

text_train_ds = text_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
text_val_ds = text_val_ds.cache().prefetch(buffer_size=AUTOTUNE)
# test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
img_train_ds = img_train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
img_val_ds = img_val_ds.cache().prefetch(buffer_size=AUTOTUNE)

Here I define two models, as they are defined, again, in the examples I tried to follow, but trying to adapt them to the API approach.

num_classes = 25 


text_input = keras.Input(shape=(None,), name="text")  
text_features = layers.Embedding(max_features 1, 16)(text_input)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.GlobalAveragePooling1D()(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.Dense(32)(text_features)
text_features = keras.Model(text_input,text_features)

image_input = keras.Input(shape=(180, 180, 3),name="image")
image_features=data_augmentation(image_input)
image_features=layers.Rescaling(1./255)(image_features)
image_features=layers.Conv2D(16, 3, padding='same', activation='relu')(image_features)
image_features=layers.MaxPooling2D()(image_features)
image_features=layers.Conv2D(32, 3, padding='same', activation='relu')(image_features)
image_features= layers.MaxPooling2D()(image_features)
image_features=layers.Conv2D(64, 3, padding='same', activation='relu')(image_features)
image_features=layers.MaxPooling2D()(image_features)
image_features=layers.Dropout(0.2)(image_features)
image_features=layers.Flatten()(image_features)
image_features=layers.Dense(128, activation='relu')(image_features)
image_features=layers.Dense(32, activation='relu')(image_features)
image_features=keras.Model(image_input,image_features)

x = layers.concatenate([text_features.output, image_features.output])
category_pred = layers.Dense(num_classes, name="classes")(x)


model = keras.Model(
    inputs=[text_input, image_input],
    outputs=[category_pred],)

I tried with different loss, metrics and optimizers, just to try my way out of the problem.

I feel like it's maybe a semantic problem, as the error suggest (remember, not an expert here) that the model doesn't understands what I'm trying to introduce as an input. but this is how inputs are introduced in the examples I strudied, so I'm lost.

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
epochs = 1
checkpoint_path = "training_3_test_ojo/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
model.fit(
    {'image':img_train_ds,
     'text':text_train_ds,
     },
    epochs=epochs,
    batch_size=32,)

The problem could be that I naively thought I can just load my two data sets independently and expect my model would find the way to concatenate it.

I'm not specifying the expected output for trainning, again, assuming that the model will extract it from the inputs. But I tried to specify it and it didn't make any difference. I would also vote for this as the problem of my code. Not specifying the expected output works for the 'image classification' example I used, but I do realize it doesn't have to work for a model with multiple inputs.

I will appreciate any solution, guidance or reference.

CodePudding user response：

As per the documentation on fit(), if you're passing a dictionary, the keys need to point to an array or tensor. You're using tensorflow.python.data.ops.dataset_ops.PrefetchDataset, which won't work with dict.