Home > front end >  Single image evaluation for pytorch resnet model on opencv frame and png file
Single image evaluation for pytorch resnet model on opencv frame and png file

Time:12-03

I have a video in .mp4 format: eval.mp4. I also have a fine-tuned pytorch resnet nn with which I would like to perform inference on single frames that are read from the video or single png files that are saved to disk

My pre-trained nn successfully uses .png files that I load from disk and then perform the training/validation transforms. But during inference, rather than writing each frame of the eval.mp4 video to disk as .png files solely for the purpose of inferring on every frame, I would like to simply transform each captured frame into the correct format that can be evaluated by the network.

My dataset classes / dataloaders look like:

# create total dataset, no transforms
class MouseDataset(Dataset):
    def __init__(self, csv_file, root_dir, transform=None):
        """
        Args:
            csv_file (string): Path to the csv file with annotations.
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.mouse_frame = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.mouse_frame)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        # img_name is root_dir file_name
        img_name = os.path.join(self.root_dir,
                                self.mouse_frame.iloc[idx, 0])
        image = Image.open(img_name)
        coordinates = self.mouse_frame.iloc[idx, 1:]
        coordinates = np.array([coordinates])

        if self.transform:
            image = self.transform(image)
        
        return (image, coordinates)

# break total dataset into subsets for different transforms
class DatasetSubset(Dataset):
    def __init__(self, dataset, transform=None):
        self.dataset = dataset
        self.transform = transform
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, index):
        
        # get image
        image = self.dataset[index][0]
        # transform for input into nn
        if self.transform:
            image = image.convert('RGB')
            image = self.transform(image)
            image = image.to(torch.float)
            #image = torch.unsqueeze(image, 0)
        
        # get coordinates
        coordinates = self.dataset[index][1]
        # transform for input into nn
        coordinates = coordinates.astype('float').reshape(-1, 2)
        coordinates = torch.from_numpy(coordinates)
        coordinates = coordinates.to(torch.float)

        return (image, coordinates)

# create training / val split
train_split = 0.8
train_count = int(train_split * len(total_dataset))
val_count = int(len(total_dataset) - train_count)
train_subset, val_subset = torch.utils.data.random_split(total_dataset, [train_count, val_count])

# create training / val datasets
train_dataset = DatasetSubset(train_subset, transform = data_transforms['train'])
val_dataset = DatasetSubset(val_subset, transform = data_transforms['val'])

# create train / val dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, num_workers=num_workers)
dataloaders_dict = {}
dataloaders_dict['train'] = train_dataloader
dataloaders_dict['val'] = val_dataloader

My training vs. validation transforms (which are identical for testing purposes):

# Data augmentation and normalization for training
# Just normalization for validation
# required dimensions of input image
input_image_width = 224
input_image_height = 224

# mean and std of RGB pixel intensities
# ImageNet mean [0.485, 0.456, 0.406]
# ImageNet standard deviation [0.229, 0.224, 0.225]
model_mean = [0.485, 0.456, 0.406]
model_std = [0.229, 0.224, 0.225]

data_transforms = {
    'train': transforms.Compose([
        transforms.Resize((input_image_height, input_image_width)),
        transforms.ToTensor(),
        transforms.Normalize(model_mean, model_std)
    ]),
    'val': transforms.Compose([
        transforms.Resize((input_image_height, input_image_width)),
        transforms.ToTensor(),
        transforms.Normalize(model_mean, model_std)
    ]),
}

What I've tried to do is read each frame from an opencv vidcapture object, convert to PIL using this answer, and then infer but the result I'm getting is very different from the simply reading the frame, saving as a .png and then inferring on the .png.

The code that I am testing:

# Standard imports
import cv2
import numpy as np
import torch
import torchvision
from torchvision import models, transforms
from PIL import Image

# load best model for evaluation
BEST_PATH = 'resnet152_best.pt'
model_ft = torch.load(BEST_PATH)
#print(model_ft)
model_ft.eval()

# Data augmentation and normalization for training
# Just normalization for validation
# required dimensions of input image
input_image_width = 224
input_image_height = 224

# mean and std of RGB pixel intensities
# ImageNet mean [0.485, 0.456, 0.406]
# ImageNet standard deviation [0.229, 0.224, 0.225]
model_mean = [0.485, 0.456, 0.406]
model_std = [0.229, 0.224, 0.225]

data_transforms = {
    'train': transforms.Compose([
        transforms.Resize((input_image_height, input_image_width)),
        transforms.ToTensor(),
        transforms.Normalize(model_mean, model_std)
    ]),
    'val': transforms.Compose([
        transforms.Resize((input_image_height, input_image_width)),
        transforms.ToTensor(),
        transforms.Normalize(model_mean, model_std)
    ]),
}

# Read image
cap = cv2.VideoCapture('eval.mp4')
total_frames = cap.get(7)
cap.set(1, 6840)
ret, frame = cap.read()
cv2.imwrite('eval_6840.png', frame)
png_file = 'eval_6840.png'

# eval png
png_image = Image.open(png_file)
png_image = png_image.convert('RGB')
png_image = data_transforms['val'](png_image)
png_image = png_image.to(torch.float)
png_image = torch.unsqueeze(png_image, 0)
print(png_image.shape)
output = model_ft(png_image)
print(output)

# eval frame
vid_image = Image.fromarray(frame)
vid_image = vid_image.convert('RGB')
vid_image = data_transforms['val'](vid_image)
vid_image = vid_image.to(torch.float)
vid_image = torch.unsqueeze(vid_image, 0)
print(vid_image.shape)
output = model_ft(vid_image)
print(output)

This returns:

torch.Size([1, 3, 224, 224])
tensor([[ 0.0229, -0.0990]], grad_fn=<AddmmBackward0>)
torch.Size([1, 3, 224, 224])
tensor([[ 0.0797, -0.2219]], grad_fn=<AddmmBackward0>)

My questions are:

(1) Why is the opencv frame evaluation different from the png file evaluation? All of the transformations appear to be identical (including the RGB conversion per the comments).

(2) How can I make the frame evaluation identical to the png evaluation given that both images are captured from the exact same segment of the video?

CodePudding user response:

Here's a nice fan fact about : it works in BGR space, rather than RGB.

This might be the reason why you have different results processing png images (read via PIL.Image) vs processing video frames (read via ).

CodePudding user response:

Posting this answer here in case in helps anyone.

The issue is that: png_image = Image.open(png_file) creates an object of this type: PIL.PngImagePlugin.PngImageFile.

A vidcapture frame, however, creates an object of type: numpy.ndarray. And the conversion step: vid_image = Image.fromarray(frame) creates an object of type: PIL.Image.Image

I tried converting PIL.Image.Image object to a PIL.PngImagePlugin.PngImageFile and vice versa to make them comparable, but it does not seem possible using the PIL method convert. Others seem to have had this issue as well.

So the solution was to convert back and forth between numpy.ndarray types and PIL image types to make use of the transforms functionality in the PIL image library on which pytorch relies. Probably not the most efficient method, but end result is identical input objects and model predictions.

For reference:

# Read image
cap = cv2.VideoCapture('eval.mp4')
total_frames = cap.get(7)
cap.set(1, 6840)
ret, frame = cap.read()
cv2.imwrite('eval_6840.png', frame)
png_file = 'eval_6840.png'

# eval png
png_image = Image.open(png_file)
png_array = np.array(png_image)
png_image = Image.fromarray(png_array)
png_image = data_transforms['val'](png_image)
png_image = png_image.to(torch.float)
png_image = torch.unsqueeze(png_image, 0)
png_image = png_image.to(device)
output = model_ft(png_image)
print(output)

# eval frame
vid_array = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
vid_image = Image.fromarray(vid_array)
vid_image = data_transforms['val'](vid_image)
vid_image = vid_image.to(torch.float)
vid_image = torch.unsqueeze(vid_image, 0)
vid_image = vid_image.to(device)
output = model_ft(vid_image)
print(output)

Yields:

tensor([[ 0.0229, -0.0990]], grad_fn=<AddmmBackward0>)
tensor([[ 0.0229, -0.0990]], grad_fn=<AddmmBackward0>)
  • Related