Home > Software design >  Store Subtitles in a Database
Store Subtitles in a Database

Time:09-26

I'm working on a project that uses AI to recognise the speech of an audio file. The output of this AI is a huge JSON object with tons of values. I'll remove some keys, and the final structure will look as follows.

{
    text: "<recognised text>",
    language: "<detected language>"
    segments: [
        {startTimestamp: "00:00:00", endTimestamp: "00:00:10", text: "<some text>"},
        {startTimestamp: "00:00:10", endTimestamp: "00:00:17", text: "<some text>"},
        {startTimestamp: "00:00:17", endTimestamp: "00:00:26", text: "<some text>"},
        { ... },
        { ... }
    ]
}

Now, I wish to store this new trimmed object in a SQL database because I wish to be able to edit it manually. I'll create a React application to edit segments, delete segments, etc. Additionally, I want to add this feature to the React application, where the information will be saved every 5 seconds using an AJAX call.

Now, I don't understand how I should store this object in the SQL database. Initially, I thought I would store the whole object as a string in a database. Whenever some change is made to the object, I'll send a JSON object from the React application, the backend will sanitize it and then replace the old stringified object in the database with the new sanitised string object. This way updating and deletion will happen with ease but there can be issues in case of searching. But I'm wondering if there are any better approaches to do this.

Could someone guide me on this?

Tech Stack

  • Frontend - React
  • Backend - Django 3.2.15
  • Database - PostgreSQL

Thank you

CodePudding user response:

Now, I don't understand how I should store this object in the SQL database. Initially, I thought I would store the whole object as a string in a database.

If the data has a clear structure, you should not store it as a JSON blob in a relational database. While relational databases have some support for JSON nowadays, it is still not very effective, and normally it means you can not effectively filter, aggregate, and manipulate data, nor can you check referential integrity.

You can work with two models that look like:

from django.db import models
from django.db.models import F, Q


class Subtitle(models.Model):
    text = models.CharField(max_length=128)
    language = models.CharField(max_length=128)


class Segment(models.Model):
    startTimestamp = models.DurationField()
    endTimestamp = models.DurationField()
    subtitle = models.ForeignKey(
        Subtitle, on_delete=models.CASCADE, related_name='segments'
    )
    text = models.CharField(max_length=512)

    class Meta:
        ordering = ('subtitle', 'startTimestamp', 'endTimestamp')
        constraints = [
            models.CheckConstraint(
                check=Q(startTimestamp__gt=F('endTimestamp')),
                name='start_before_end',
            )
        ]

This will also guarantee that the startTimestamp is before the endTimestamp for example, that these fields store durations (and not "foo" for example).

You can convert from and to JSON with serializers [drf-doc]:

from rest_framework import serializers


class SegmentSerializer(serializers.ModelSerializer):
    class Meta:
        model = Segment
        fields = ['startTimestamp', 'endTimestamp', 'text']


class SubtitleSerializer(serializers.ModelSerializer):
    segments = SegmentSerializer(many=True)

    class Meta:
        model = Subtitle
        fields = ['text', 'language', 'segments']
  • Related