Home > Software engineering >  Finding duplicate rows where one column is different, based off a time period
Finding duplicate rows where one column is different, based off a time period

Time:12-17

I have a table of customers, that lists when we sent their due invoice and if they paid or not. If a customer did pay the invoice, it's "1" and if they didn't it's "0". I want to find all records where a customer paid in one year, but not the next. So for example, select all customers who paid between 1/1/2020 and 12/30/2020 but didn't pay between 1/1/2021 and 12/30/2021 (order = 1 in 2020 but order = 0 in 2021) based off the website.

ORDERS

WEBSITE      PAID       SENTDATE
one.com        1        1/1/2020
two.com        1        1/5/2020
one.com        0        1/1/2021
four.com       1        2/1/2021
two.com        1        1/5/2021
five.com       1        10/15/2021

From this list, only "one.com" would be selected, because two.com paid both years, four and five only have one transaction in only one year.

I can't wrap my head around how to do this, if I have to inner join and then where not exists? Usually I can figure this out but I'm stumped.

CodePudding user response:

If you just want to know the websites that fit the criteria, this will work. You could possibly do a YEAR() function on sentdate, but then you'd have to do a table scan rather than using a possible index on SentDate.

SELECT Website FROM myTable WHERE Paid=1 AND SentDate BETWEEN '2020-01-01' AND '2020-12-31' AND Website IN (SELECT Website FROM myTable WHERE Paid=0 AND SentDate BETWEEN '2021-01-01' AND '2021-12-31')

If you want to know the dates the invoices were sent, you can do an inner join

SELECT t1.Website, t1.SentDate AS PaidOn, t2.SentDate AS NotPaidOn
FROM myTable t1
INNER JOIN myTable t2 ON t1.Website = t2.Website
WHERE t1.Paid = 1 AND t1.SentDate BETWEEN '2020-01-01' AND '2020-12-31'
AND t2.Paid = 0 AND t2.SentDate '2021-01-01' AND '2021-12-31'

CodePudding user response:

The EXCEPT operator is one way of accomplishing this, along with a bit if date (year) shifting:

DECLARE @Orders TABLE (Website VARCHAR(50), Paid BIT, SentDate DATETIME)
INSERT @Orders
VALUES
    ('one.com', 1, '1/1/2020'),
    ('two.com', 1, '1/5/2020'),
    ('one.com', 0, '1/1/2021'),
    ('four.com', 1, '2/1/2021'),
    ('two.com', 1, '1/5/2021')

SELECT DISTINCT Year = YEAR(SentDate), WebSite FROM @Orders WHERE Paid = 1
    AND YEAR(SentDate) < (SELECT YEAR(MAX(SentDate)) FROM @Orders)
EXCEPT
SELECT DISTINCT Year = YEAR(SentDate) - 1, WebSite FROM @Orders WHERE Paid = 1

It pretty much reads: Show me everyone who paid in year Y, but exclude ("subtract") anyone who paid in the following year. Since everyone who paid in the latest year hasn't yet had a chance to pay in the future year, I added an extra condition to exclude those cases.

*** EDIT ***

After rereading, I realized I missed the explicit not-paid condition. In that case, logic similar to that posed by LeeG would do the trick.

SELECT Year = YEAR(O1.SentDate), O1.WebSite
FROM @Orders O1
JOIN @Orders O2
    ON O2.Website = O1.Website
    AND O2.Paid = 0
    AND YEAR(O2.SentDate) = YEAR(O1.SentDate)   1
WHERE O1.Paid = 1
ORDER BY YEAR(O1.SentDate) DESC, O1.WebSite

CodePudding user response:

You could also use the LEAD function, maybe like this (which handles websites that paid in one year and failed to pay in the next year in which they were invoiced, even if that is not the successive year.

WITH TwoYearView AS (
  SELECT
    WebSite,
    YEAR(SentDate) as Year,
    LEAD(YEAR(SentDate),1) OVER (PARTITION BY Website ORDER BY SentDate) as NextInvoicedYear,
    Paid,
    LEAD(Paid, 1) OVER (PARTITION BY Website ORDER BY SentDate) AS PaidNextYear
  FROM Orders
)
SELECT 
  WebSite,
  Year AS YearPaid,
  NextInvoicedYear 
FROM TwoYearView
WHERE Paid = 1 AND PaidNextYear = 0
  • Related