I need to execute a SQL query that deletes the duplicated rows based on one column and keep the last record. Noting that it's a large table so Django ORM takes very long time so I need SQL query instead. the column name is customer_number and table name is pages_dataupload. I'm using sqlite.
Update: I tried this but it gives me no such column: row_num
cursor = connection.cursor()
cursor.execute(
'''WITH cte AS (
SELECT
id,
customer_number ,
ROW_NUMBER() OVER (
PARTITION BY
id,
customer_number
ORDER BY
id,
customer_number
) row_num
FROM
pages.dataupload
)
DELETE FROM pages_dataupload
WHERE row_num > 1;
'''
)
CodePudding user response:
You can work with an Exists
subquery [Django-doc] to determine efficiently if there is a younger DataUpload
:
from django.db.models import Exists, OuterRef
DataUpload.objects.filter(Exists(
DataUpload.objects.filter(
pk__gt=OuterRef('pk'), customer_number=OuterRef('customer_number')
)
)).delete()
This will thus check for each DataUpload
if there exists a DataUpload
with a larger primary key that has the same customer_number
. If that is the case, we will remove that DataUpload
.
CodePudding user response:
I have solved the problem with the below query, is there any way to reset the id field after removing the duplicate?
cursor = connection.cursor()
cursor.execute(
'''
DELETE FROM pages_dataupload WHERE id not in (
SELECT Max(id) FROM pages_dataupload Group By Dial
)
'''
)