Django Delete duplicates rows and keep the last using SQL query-CodePudding

I need to execute a SQL query that deletes the duplicated rows based on one column and keep the last record. Noting that it's a large table so Django ORM takes very long time so I need SQL query instead. the column name is customer_number and table name is pages_dataupload. I'm using sqlite.

Update: I tried this but it gives me no such column: row_num

    cursor = connection.cursor()
    cursor.execute(

    '''WITH cte AS (
        SELECT 
            id, 
            customer_number , 
            ROW_NUMBER() OVER (
                PARTITION BY 
                    id,
                    customer_number 

                ORDER BY 
                    id,
                    customer_number 
            ) row_num
        FROM 
            pages.dataupload
    )
    DELETE FROM pages_dataupload
    WHERE row_num > 1;

    '''
    )

CodePudding user response：

You can work with an Exists subquery ^[Django-doc] to determine efficiently if there is a younger DataUpload:

from django.db.models import Exists, OuterRef

DataUpload.objects.filter(Exists(
    DataUpload.objects.filter(
        pk__gt=OuterRef('pk'), customer_number=OuterRef('customer_number')
    )
)).delete()

This will thus check for each DataUpload if there exists a DataUpload with a larger primary key that has the same customer_number. If that is the case, we will remove that DataUpload.

CodePudding user response：

I have solved the problem with the below query, is there any way to reset the id field after removing the duplicate?

    cursor = connection.cursor()
    cursor.execute(

    ''' 
        DELETE FROM pages_dataupload WHERE id not in (
            SELECT Max(id) FROM pages_dataupload Group By Dial
            )


    '''
    )