Home > database >  What is the best way to handle DJANGO migration data with over 500k records for MYSQL
What is the best way to handle DJANGO migration data with over 500k records for MYSQL

Time:04-12

  • A migration handles creating two new fields action_duplicate and status_duplicate
  • The second migration copies the data from the action and status fields, to the two newly created fields
def remove_foreign_keys_from_user_request(apps, schema_editor):
    UserRequests = apps.get_model("users", "UserRequest")

    for request_initiated in UserRequest.objects.all().select_related("action", "status"):
        request_initiated.action_duplicate = request_initiated.action.name
        request_initiated.status_duplicate = request_initiated.status.name
        request_initiated.save()
  • The third migration is suppose to remove/delete the old fields action and status
  • The fourth migration should rename the new duplicate fields to the old deleted fields The solution here is to remove the dependency on the status and action, to avoid unnecessary data base query, since the status especially will only be pending and completed

My question is for the second migration. The number of records are between 300k to 600k records so I need to know a more efficient way to do this so it doesn't take up all the memory available. Note: The Database is MySQL.

A trimmed-down version of the UserRequest model

class UserRequest(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    reference = models.CharField(max_length=50, null=True, blank=True)
    requester = models.ForeignKey(User, blank=True, null=True, on_delete=models.CASCADE)
    action = models.ForeignKey(Action, on_delete=models.CASCADE)
    action_duplicate = models.CharField(
        max_length=50, choices=((ACTION_A, ACTION_A), (ACTION_B, ACTION_B)), default=ACTION_A
    )
    status = models.ForeignKey(ProcessingStatus, on_delete=models.CASCADE)
    status_duplicate = models.CharField(
        max_length=50,
        choices=((PENDING, PENDING), (PROCESSED, PROCESSED)),
        default=PENDING,
    )

CodePudding user response:

You can work with a Subquery expression [Django-doc], and do the update in bulk:

def remove_foreign_keys_from_user_request(apps, schema_editor):
    UserRequests = apps.get_model('users', 'UserRequests')
    Action = apps.get_user('users', 'Action')
    Status = apps.get_user('users', 'ProcessingStatus')
    UserRequests.objects.update(
        action_duplicate=Subquery(
            Action.objects.filter(
                pk=OuterRef('action_id')
            ).values('name')[:1]
        ),
        status_duplicate=Subquery(
            Status.objects.filter(
                pk=OuterRef('status_id')
            ).values('name')[:1]
        )
    )

That being said, it looks that what you are doing is actually the opposite of database normalization [wiki]: usually if there is duplicated data, you make an extra model where you make one Action/Status per value, and thus prevent having the same value for action_duplicate/status_duplicate multiple times in the database: this will make the database larger, and harder to maintain.


Note: normally a Django model is given a singular name, so UserRequest instead of UserRequests.

  • Related