Home > Net >  How to compare and show if any changes in historical data in AWS Athena?
How to compare and show if any changes in historical data in AWS Athena?

Time:08-29

I have a table where historical data is stored. The sample structure of the table is like,

id config userid team createdon
1 "config-01" "user-01" "Dev" "2022:08:28"
2 "config-02" "user-01" "Dev" "2022:08:27"
3 "config-01" "user-01" "Dev" "2022:08:26"
4 "config-01" "user-01" "QA" "2022:08:25"
5 "config-01" "user-01" "QA" "2022:08:24"
6 "config-20" "user-01" "QA" "2022:08:23"
7 "config-01" "user-01" "QA" "2022:08:22"
8 "config-01" "user-01" "Prod" "2022:08:21"
9 "config-01" "user-01" "Prod" "2022:08:20"
10 "config-11" "user-02" "Prod" "2022:08:19"
11 "config-15" "user-02" "Prod" "2022:08:18"

Now, I need to query data based on user-id, so first 9 rows will be taken for user-id "user-01". From these 9 rows I need to take latest two rows(based on createdon) of each team. So the table as of now looks like this,

id config userid team createdon
1 "config-01" "user-01" "Dev" "2022:08:28"
2 "config-02" "user-01" "Dev" "2022:08:27"
4 "config-01" "user-01" "QA" "2022:08:28"
5 "config-01" "user-01" "QA" "2022:08:27"
8 "config-01" "user-01" "Prod" "2022:08:28"
9 "config-01" "user-01" "Prod" "2022:08:27"

Now I need to compare two rows of each team against their config column and say if there is a change between them. So the final response I am looking for will be,

userid team configchanged
"user-01" "Dev" TRUE
"user-01" "QA" FALSE
"user-01" "Prod" FALSE

As of now I was able to reach at a point where I can see rows with same team together using this self join query,

select * from table t1, table t2
where t1.id <> t2.id and t1.team = t2.team and t1.userid = "user-01"

I need help on rest of the part.

Any help on generating a query to achieve this solution would be really helpful. Thank You

CodePudding user response:

Not sure why this is tagged both with MySQL and Presto, but here is approach for Presto/Trino.

You can use window functions (lag in particular to compare "current" and "previous" records) and group by. Here is a little bit more generic approach (without filtering, but that is easy to add):

-- sample data
with dataset (id, config, userid, team, createdon) as (
    values (1   , 'config-01', 'user-01', 'Dev',    '2022:08:28'),
        (2  , 'config-02', 'user-01', 'Dev',    '2022:08:27'),
        (3  , 'config-01', 'user-01', 'Dev',    '2022:08:26'),
        (4  , 'config-01', 'user-01', 'QA', '2022:08:25'),
        (5  , 'config-01', 'user-01', 'QA', '2022:08:24'),
        (6  , 'config-20', 'user-01', 'QA', '2022:08:23'),
        (7  , 'config-01', 'user-01', 'QA', '2022:08:22'),
        (8  , 'config-01', 'user-01', 'Prod','2022:08:21'),
        (9  , 'config-01', 'user-01', 'Prod','2022:08:20'),
        (10 , 'config-11', 'user-02', 'Prod','2022:08:19'),
        (11 , 'config-15', 'user-02', 'Prod','2022:08:18')
)

-- query
select userid, 
   team, 
   max_by(configchanged, createdon) configchanged -- get last change status
from (
   select *,
    config != coalesce(lag(config) over (partition by userid, team order by createdon), config) configchanged -- coalesce is used just in case there is only one row 
   from dataset
)
group by userid, team

Output:

userid team configchanged
user-01 QA false
user-01 Prod false
user-01 Dev true
user-02 Prod true
  • Related