Home > Back-end >  Looping Through Dates in SQL? (Databricks)
Looping Through Dates in SQL? (Databricks)

Time:06-21

I am currently learning SQL and ran into a problem. Through my searches I have found that looping in SQL is a big no-no, so I was wondering if anyone could point me into the correct direction?

The dataframe looks like this:

Group ATP Date JTH Date
A 5/17/2022 6/17/2022
A 5/17/2022 Null
B 5/17/2022 Null
A 5/16/2022 6/16/2022
B 5/16/2022 6/16/2022
B 5/15/2022 6/17/2022
B 5/15/2022 Null
A 5/14/2022 6/1/2022
A 5/13/2022 Null
A 5/13/2022 6/1/2022
A 5/13/2022 6/5/2022

I am trying to make a query to pull this:

Date Group CountNo CountYes Ratio (No/Yes)
5/17/2022 A 1 1 1
5/17/2022 B 0 1 0
5/16/2022 A 1 0 Null
5/16/2022 B 2 1 2
5/14/2022 A 1 0 Null
5/13/2022 A 2 0 Null

This is what I currently created:

max(ATP_Date) as Date,
Group,
sum(
case
when ATP_Date < '2022-05-18'
and JTH_Date > '2022-05-18' then 1
else 0
END
) as CountNo,
sum(
case
when ATP_Date < '2022-05-18'
and JTH_Date IS null then 1
else 0
END
) as CountYes,
sum(
case
when ATP_date < '2022-05-18'
and JTH_Date > '2022-05-18' then 1
else 0
END
) / sum(
case
when ATP_Date < '2022-05-18'
and JTH_Date IS null then 1
else 0
END
) as ratio
from
dataframe

where group = "A"
GROUP BY group 

Which outputs this:

Date Group CountNo CountYes Ratio
5/17/2022 A 318 1064 0.3

This is what I want, but I need to do it for each date for ~ last 4 years, so it looks like the second table posted. I could manually edit the dates for each query, but that would take forever. This made me think of looping. I believe I would basically need to loop through the Select portion with dates, in order to get the output I want. If anyone has advice or could point in me in the correction direction, it would be greatly appreciated, thanks.

CodePudding user response:

I couldn't write the query as a comment, so posting it here. If it's not what you are expecting, let me know will delete this.

Assuming 2022-05-18 is constant (as per your sample)

Creating sample table

create or replace table dt_query 
(group string, atp_date date, jth_date date);
insert into dt_query values 
('A','2022-05-17','2022-06-17')
,('A','2022-05-17',NULL)
,('B','2022-05-17',NULL)
,('A','2022-05-16','2022-06-16')
,('B','2022-05-16','2022-06-16')
,('B','2022-05-16','2022-06-15')
,('B','2022-05-16',NULL)

Slightly modified your select statement

select
max(ATP_Date) as Date,
Group,
sum(
case
when ATP_Date < '2022-05-18'
and JTH_Date > '2022-05-18' then 1
else 0
END
) as CountNo,
sum(
case
when ATP_Date < '2022-05-18'
and JTH_Date IS null then 1
else 0
END
) as CountYes,
sum(
case
when ATP_date < '2022-05-18'
and JTH_Date > '2022-05-18' then 1
else 0
END
) / sum(
case
when ATP_Date < '2022-05-18'
and JTH_Date IS null then 1
else 0
END
) as ratio
from
dt_query
GROUP BY group,atp_date

The result matches with what you are expecting.

enter image description here

CodePudding user response:

To avoid using loops here you can include atp_date into the group by clause which will result in 1 row for each combination of that date plus the "grouping" column

It isn't clear why you compare to '2022-05-18' but this appears to be the day following the maximum date found in atp_date. So to avoid hardcoding, you could approach it by using a derived table of 1 row, cross joined to the data:

SELECT 
      ATP_Date
    , grouping
    , sum(CASE 
            WHEN ATP_Date < cj.max_dt AND JTH_Date > cj.max_dt
                THEN 1
            ELSE 0
            END) AS CountNo
    , sum(CASE 
            WHEN ATP_Date < cj.max_dt AND JTH_Date IS NULL
                THEN 1
            ELSE 0
            END) AS CountYes
    , sum(CASE 
            WHEN ATP_date < cj.max_dt AND JTH_Date > cj.max_dt
                THEN 1
            ELSE 0
            END)  
    / sum(CASE 
            WHEN ATP_Date < cj.max_dt AND JTH_Date IS NULL
                THEN 1
            ELSE NULL
            END) AS ratio
FROM dt_query
CROSS JOIN (select max(atp_date)   interval '1 day' max_dt from dt_query) AS cj
GROUP BY
      grouping
    , atp_date
ORDER BY
      atp_date DESC
    , grouping
atp_date   | grouping | countno | countyes | ratio
:--------- | :------- | ------: | -------: | ----:
2022-05-17 | A        |       1 |        1 |     1
2022-05-17 | B        |       0 |        1 |     0
2022-05-16 | A        |       1 |        0 |  null
2022-05-16 | B        |       2 |        1 |     2
2022-05-14 | A        |       1 |        0 |  null
2022-05-13 | A        |       2 |        1 |     2

db<>fiddle here

nb: to avoid issues with the term "group" I have used the column name "grouping" instead, and the example sql is written in postgres so there may be some syntax that needs alteration (e.g. the addition of 1 day). Also note that the ratio calcuation can result in a divide by zero error so instead of zero I used NULL.

  • Related