get contrastive_logits_per_image with flava model using huggingface library-CodePudding

I have used a code of Flava model from this link:

https://huggingface.co/docs/transformers/model_doc/flava#transformers.FlavaModel.forward.example

But I am getting the following error:

'FlavaModelOutput' object has no attribute 'contrastive_logits_per_image'

I tried using FlavaForPreTraining model instead, so updated code was :

from PIL import Image
import requests
from transformers import FlavaProcessor, FlavaForPreTraining

model = FlavaForPreTraining.from_pretrained("facebook/flava-full")
processor = FlavaProcessor.from_pretrained("facebook/flava-full")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True, return_codebook_pixels = True)

inputs.update(
    {
        "input_ids_masked": inputs.input_ids,
    }
)

outputs = model(**inputs)

logits_per_image = outputs.contrastive_logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

but I'm still getting this as error:

/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:714: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-44-bdb428b8184a> in <module>()
----> 1 outputs = model(**inputs)

2 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/flava/modeling_flava.py in forward(self, input_ids, input_ids_masked, pixel_values, codebook_pixel_values, attention_mask, token_type_ids, bool_masked_pos, position_ids, image_attention_mask, skip_unmasked_multimodal_encoder, mlm_labels, mim_labels, itm_labels, output_attentions, output_hidden_states, return_dict, return_loss)
   1968             if mim_labels is not None:
   1969                 mim_labels = self._resize_to_2d(mim_labels)
-> 1970                 bool_masked_pos = self._resize_to_2d(bool_masked_pos)
   1971                 mim_labels[bool_masked_pos.ne(True)] = self.ce_ignore_index
   1972 

/usr/local/lib/python3.7/dist-packages/transformers/models/flava/modeling_flava.py in _resize_to_2d(self, x)
   1765 
   1766     def _resize_to_2d(self, x: torch.Tensor):
-> 1767         if x.dim() > 2:
   1768             x = x.view(x.size(0), -1)
   1769         return x

AttributeError: 'NoneType' object has no attribute 'dim'

Can anyone provide suggestions with what's going wrong?

CodePudding user response：

FLAVA's author here.

Can you please add the following arguments to your processor call:

return_codebook_pixels=True, return_image_mask=True

Here is an example colab if you want to see how to call FLAVA model: https://colab.research.google.com/drive/1c3l4r4cEA5oXfq9uXhrJibddwRkcBxzP?usp=sharing#scrollTo=xtkrSjfhCdv-