How to measure performance of image-conditioned GANs?-CodePudding

I'm trying to implement a GAN based on a pix2pix approach for an image-to-image translation task. I started to obtain some results with my minor improvements but now I wonder how to correctly compare my model with the existing ones quantitatively?

While metrics such as Frechet Inception Distance or Inception Score are widely used for comparing unconditioned GANs they definitely fail to capture the limitations and objectives imposed by the input image. What would be a good choice for such a metric? Now I track SSIM and PSNR during the training however, I'm aware that those are imperfect in terms of human perception and results generated by GANs. I know that there are things such as Learned Perceptual Image Patch Similarity (LPIPS) but I don't see them widely used.

What metric should I use to compare with the others?

CodePudding user response：

You need a metric for paired image-to-image translation task so the 'distribution-based' metrics like a FID or Inception Score may not be relevant. Full-Reference is a type of metric that could be more useful: https://en.wikipedia.org/wiki/Image_quality#Objective_methods

The choice of the metric really depends on the domain you are training on. Creating a metric that takes into account human perception is a difficult problem. It is kinda open problem with no the 'best' metric that works perfectly on the all types of data.

LPIPS is a well researched metric and could be treated as 'standard' metric along with PSNR and SSIM. the paper "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" in which the LPIPS was presented has 3000 citations. LPIPS is widely used in image synthesis research papers. It can be a good starting point if you are working with natural images.

Also you may check some other popular metrics here: https://github.com/photosynthesis-team/piq