How to compare and show if any changes in historical data in AWS Athena?-CodePudding

I have a table where historical data is stored. The sample structure of the table is like,

id	config	userid	team	createdon
1	"config-01"	"user-01"	"Dev"	"2022:08:28"
2	"config-02"	"user-01"	"Dev"	"2022:08:27"
3	"config-01"	"user-01"	"Dev"	"2022:08:26"
4	"config-01"	"user-01"	"QA"	"2022:08:25"
5	"config-01"	"user-01"	"QA"	"2022:08:24"
6	"config-20"	"user-01"	"QA"	"2022:08:23"
7	"config-01"	"user-01"	"QA"	"2022:08:22"
8	"config-01"	"user-01"	"Prod"	"2022:08:21"
9	"config-01"	"user-01"	"Prod"	"2022:08:20"
10	"config-11"	"user-02"	"Prod"	"2022:08:19"
11	"config-15"	"user-02"	"Prod"	"2022:08:18"

Now, I need to query data based on user-id, so first 9 rows will be taken for user-id "user-01". From these 9 rows I need to take latest two rows(based on createdon) of each team. So the table as of now looks like this,

id	config	userid	team	createdon
1	"config-01"	"user-01"	"Dev"	"2022:08:28"
2	"config-02"	"user-01"	"Dev"	"2022:08:27"
4	"config-01"	"user-01"	"QA"	"2022:08:28"
5	"config-01"	"user-01"	"QA"	"2022:08:27"
8	"config-01"	"user-01"	"Prod"	"2022:08:28"
9	"config-01"	"user-01"	"Prod"	"2022:08:27"

Now I need to compare two rows of each team against their config column and say if there is a change between them. So the final response I am looking for will be,

userid	team	configchanged
"user-01"	"Dev"	TRUE
"user-01"	"QA"	FALSE
"user-01"	"Prod"	FALSE

As of now I was able to reach at a point where I can see rows with same team together using this self join query,

select * from table t1, table t2
where t1.id <> t2.id and t1.team = t2.team and t1.userid = "user-01"

I need help on rest of the part.

Any help on generating a query to achieve this solution would be really helpful. Thank You

CodePudding user response：

Not sure why this is tagged both with MySQL and Presto, but here is approach for Presto/Trino.

You can use window functions (lag in particular to compare "current" and "previous" records) and group by. Here is a little bit more generic approach (without filtering, but that is easy to add):

-- sample data
with dataset (id, config, userid, team, createdon) as (
    values (1   , 'config-01', 'user-01', 'Dev',    '2022:08:28'),
        (2  , 'config-02', 'user-01', 'Dev',    '2022:08:27'),
        (3  , 'config-01', 'user-01', 'Dev',    '2022:08:26'),
        (4  , 'config-01', 'user-01', 'QA', '2022:08:25'),
        (5  , 'config-01', 'user-01', 'QA', '2022:08:24'),
        (6  , 'config-20', 'user-01', 'QA', '2022:08:23'),
        (7  , 'config-01', 'user-01', 'QA', '2022:08:22'),
        (8  , 'config-01', 'user-01', 'Prod','2022:08:21'),
        (9  , 'config-01', 'user-01', 'Prod','2022:08:20'),
        (10 , 'config-11', 'user-02', 'Prod','2022:08:19'),
        (11 , 'config-15', 'user-02', 'Prod','2022:08:18')
)

-- query
select userid, 
   team, 
   max_by(configchanged, createdon) configchanged -- get last change status
from (
   select *,
    config != coalesce(lag(config) over (partition by userid, team order by createdon), config) configchanged -- coalesce is used just in case there is only one row 
   from dataset
)
group by userid, team

Output:

userid	team	configchanged
user-01	QA	false
user-01	Prod	false
user-01	Dev	true
user-02	Prod	true