I have a trained a LayoutLMv2 model from huggingface and when I try to inference it on a single image, it gives the runtime error. The code for this is below:
query = '/Users/vaihabsaxena/Desktop/Newfolder/labeled/Others/Two.pdf26.png'
image = Image.open(query).convert("RGB")
encoded_inputs = processor(image, return_tensors="pt").to(device)
outputs = model(**encoded_inputs)
preds = torch.softmax(outputs.logits, dim=1).tolist()[0]
pred_labels = {label:pred for label, pred in zip(label2idx.keys(), preds)}
pred_labels
The error comes when when I do model(**encoded_inputs)
. The processor
is called directory from Huggingface and is initialized as follows along with other APIs:
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
The model is defined and trained as follows:
model = LayoutLMv2ForSequenceClassification.from_pretrained(
"microsoft/layoutlmv2-base-uncased", num_labels=len(label2idx)
)
model.to(device);
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
for epoch in range(num_epochs):
print("Epoch:", epoch)
training_loss = 0.0
training_correct = 0
#put the model in training mode
model.train()
for batch in tqdm(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
training_loss = loss.item()
predictions = outputs.logits.argmax(-1)
training_correct = (predictions == batch['labels']).float().sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
print("Training Loss:", training_loss / batch["input_ids"].shape[0])
training_accuracy = 100 * training_correct / len(train_data)
print("Training accuracy:", training_accuracy.item())
validation_loss = 0.0
validation_correct = 0
for batch in tqdm(valid_dataloader):
outputs = model(**batch)
loss = outputs.loss
validation_loss = loss.item()
predictions = outputs.logits.argmax(-1)
validation_correct = (predictions == batch['labels']).float().sum()
print("Validation Loss:", validation_loss / batch["input_ids"].shape[0])
validation_accuracy = 100 * validation_correct / len(valid_data)
print("Validation accuracy:", validation_accuracy.item())
The complete error trace:
RuntimeError Traceback (most recent call last)
/Users/vaihabsaxena/Desktop/Newfolder/pytorch.ipynb Cell 37 in <cell line: 4>()
2 image = Image.open(query).convert("RGB")
3 encoded_inputs = processor(image, return_tensors="pt").to(device)
----> 4 outputs = model(**encoded_inputs)
5 preds = torch.softmax(outputs.logits, dim=1).tolist()[0]
6 pred_labels = {label:pred for label, pred in zip(label2idx.keys(), preds)}
File ~/opt/anaconda3/envs/env_pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
File ~/opt/anaconda3/envs/env_pytorch/lib/python3.9/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py:1071, in LayoutLMv2ForSequenceClassification.forward(self, input_ids, bbox, image, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
1061 visual_position_ids = torch.arange(0, visual_shape[1], dtype=torch.long, device=device).repeat(
1062 input_shape[0], 1
1063 )
1065 initial_image_embeddings = self.layoutlmv2._calc_img_embeddings(
1066 image=image,
1067 bbox=visual_bbox,
...
896 input_shape[0], 1
897 )
898 final_position_ids = torch.cat([position_ids, visual_position_ids], dim=1)
RuntimeError: The expanded size of the tensor (1011) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 1011]. Tensor sizes: [1, 512]
I have tried to set up the tokenizer to cut off the max length but it finds encoded_inputs
as Nonetype however the image is still there. What is going wrong here?
CodePudding user response:
The error message tells you that the extracted text via ocr is longer (1011 tokens) than the underlying text model is able to handle (512 tokens). Depending on your task, you maybe can truncate your text with the tokenizer parameter