I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from @MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
CodePudding user response:
With the LAG
function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
CodePudding user response:
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15