How to deal with lots of duplicated data in DB?-CodePudding

I'm currently building a database for a photography project and I'm having troube to decide this:

There are about 3 million photographs and I'd like this to keep working even with 15 million. They are stored on disk and with paths in the DB. All of those need a caption, but captions are highly duplicated. On Average about 10 photos have exactly the same caption (same creator, same title, same date).

Is it better to create one model?

class Photo(models.Model):
  file = models.FileField()
  headline = models.CharField()
  caption = models.TextField()
  date = models.DateField()
  created_by = models.CharField()

Or should I make 2 models even if it means having to create copies manually when one photo out of a group gets a different caption afterwards?

class Photo(models.Model):
  file = models.FileField()
  metadata = models.ForeignKey('Metadata')

  def set_metadata(self, metadata):
    self.metadata = Metadata.models.get_or_create(metadata) 

class Metadata(models.Model):
  headline = models.CharField()
  caption = models.TextField()
  date = models.DateField()
  created_by = models.CharField()

The most common task will be to search for pictures based on their metadata. Is creating an extra model and reducing the db table by factor 10 worth it? Or does it just introduce unnessesary complications with no benefits in performance?

Thank you for your help!

CodePudding user response：

This is in line with Database normalization(separate model) vs denormalization(redundant data in the same model). The decision to choose either is entirely dependent on the business use case. For instance, some questions that might help in choosing the appropriate option:

Is it expected for the average photo count against a caption(On Average about 10 photos have exactly the same caption) to increase in the future?
If Caption/Metadata object is related to multiple photos, changing or updating metadata will be reflected across all the photos. Is this expected behavior for the business?
Would the captions be shown for all the pictures?
Does every picture need a caption?

You can have a quick read on denormalisation on https://www.geeksforgeeks.org/denormalization-in-databases/ to understand its pros and cons. Given your use-case, having 2 models is a better option. Though there will be additional technical efforts to make sure Picture-Caption relation is maintained accurately, the split model design will be scalable and can help in easy analytics in the future.