How to get distinct values when year, month, day of date are stored in different columns-CodePudding

I am using AWS Athena to query count of distinct values for a column for last 7 days.

The query is invoked by a lambda function which is invoked on every Sunday of the month and pulls data from last Sunday to this Saturday.

So, for example, If today is 11th September 2022, Sunday, then the lambda will try to query the table from 4th Sep '22, Sunday till 10th Sep '22, Saturday and the query looks like this.

SELECT
    col1,
    col2,
    COUNT(DISTINCT col3) AS distinctValues
FROM "dbName"."tbl"
WHERE year = '2022'
    AND month = '09'
    AND day IN ('04','05','06','07','08','09','10' )
GROUP BY 
    col1,
    col2;

year, month and day are different columns and therefore we have IN clause for day column.

Now the issue is, if the query has to be run on 4th September 2022, then two months have to be considered. The query has to be run to get data from 28th Aug '22, Sunday to 3rd Sep '22, Saturday.

I cannot run this query to get the data as it will not contain correct count of distinct values.

SELECT
    col1,
    col2,
    COUNT(DISTINCT col3) AS distinctValues
FROM "dbName"."tbl"
WHERE year = '2022'
    AND month IN ('08','09')
    AND day IN ('28','29','30','31','01','02','03' )
GROUP BY 
    col1,
    col2;

And I can also not process the results from two separate queries for Aug and Sep months because distinct values will not be correct.

What can be done here to get distinct values for date ranges spread between two months and considering the table schema that year, month and day are stored in different columns?

CodePudding user response：

So you should compare month and days to gether and for other month the ondition must be repeated separately:

SELECT
    col1,
    col2,
    COUNT(DISTINCT col3) AS distinctValues
FROM "dbName"."tbl"
WHERE year = '2022'
    AND (month = '08' AND day IN ('X', 'Y' /*prefered days*/)
        OR  (month = '09' AND day IN ('A', 'B')))
GROUP BY 
    col1,
    col2;

CodePudding user response：

Better you pass the StartDate and EndDate as a Date field and Combine year, month and day columns of table together to form a computed DATE using date_parse and then compare the range.

SELECT
    col1,
    col2,
    COUNT(DISTINCT col3) AS distinctValues
FROM "dbName"."tbl"
WHERE 
    date_parse(cast(year * 10000   month * 100   day as varchar(255)), '%Y%m%d') BETWEEN @StartDate and @EndDate
GROUP BY 
    col1,
    col2;

Please use date_parse method in AWS Athena to get date from Year, Month and Date. Please refer Create date from integers in separate fields in athena aws

CodePudding user response：

I would recommend converting the data to date and processing it as one (using date_diff and between for example). Possibly shortest (in terms of code) way would be using array_join (if every part is in correct format). Something along this lines:

SELECT col1,
       col2,
       COUNT(DISTINCT col3) AS distinctValues
FROM "dbName"."tbl"
WHERE date_diff('day', date(array_join(array[year, month, date], '-')), now())
    between 1 and 7
GROUP BY col1,
         col2;

Though if those fields are used to partition data possibly you will need to go with another approach for better performance.