I have a table of time-series data like data in SQL which follows the following format:
Note: Each customer_ID can have multiple transaction_IDs
Customer_ID | transaction_ID | Timestamp | Value1 | Value2 |
---|---|---|---|---|
1 | 1 | 01/01/2022 17:00:00 | 1 | NULL |
1 | 1 | 01/01/2022 17:05:00 | NULL | Foo |
1 | 1 | 01/01/2022 17:10:00 | NULL | Bar |
1 | 1 | 01/01/2022 17:15:00 | 2 | NULL |
1 | 2 | 01/01/2022 17:20:00 | NULL | Wolf |
I want to create a view based on this data which follows the following format:
Customer_ID | transaction_ID | Timestamp | Value1 | Value2 |
---|---|---|---|---|
1 | 1 | 01/01/2022 17:00:00 | 1 | NULL |
1 | 1 | 01/01/2022 17:05:00 | 1 | Foo |
1 | 1 | 01/01/2022 17:10:00 | 1 | Bar |
1 | 1 | 01/01/2022 17:15:00 | 2 | Bar |
1 | 2 | 01/01/2022 17:20:00 | NULL | Wolf |
Essentially I want to "roll" the data up so that the Value1 and Value2 are the most recent value for that transaction_id at a given timestamp.
I have tried things like PARTITION BY OVER statements but it this concatenates the values into a list instead of giving the most recent value(for strings) or their sum(for numeric values).
SELECT * FROM
(SELECT
transaction_id,
timestamp,
STRING_AGG(Value1) OVER(PARTITION BY transaction_id) AS Value1,
STRING_AGG(Value2) OVER(PARTITION BY transaction_id) AS Value2
FROM Database;
CodePudding user response:
Traditional way (use subqueries as column expressions):
select
t1.Customer_Id
, t1.transaction_id
, t1.Timestamp
, (select t2.value1
from testable t2
where t2.Customer_Id=t1.Customer_Id
and t2.transaction_Id=T1.transaction_Id
and t2.TimeStamp=(select max(TimeStamp) as V1_TS
from testable t3
where t3.Customer_Id=t2.Customer_Id
and t3.transaction_Id=t2.transaction_Id
and t3.Value1 is not null
)
) LastVal1
, (select t2.value2
from testable t2
where t2.Customer_Id=t1.Customer_Id
and t2.transaction_Id=T1.transaction_Id
and t2.TimeStamp=(select max(TimeStamp) as V1_TS
from testable t3
where t3.Customer_Id=t2.Customer_Id
and t3.transaction_Id=t2.transaction_Id
and t3.Value2 is not null
)
) LastVal2
from testable t1
CodePudding user response:
To create a common value each row shares with the row the substitute should come from, you can first use the windowed version of sum()
calculating a cumulative sum. Use value1 IS NOT NULL
(or value2 IS NOT NULL
), which as a Boolean expression in numerical context will be implicitly casted to 0
or 1
in MySQL. (In other flavors of SQL (and also in MySQL itself) this can be done explicitly by using a CASE
expression.)
Then you can use this common value to partition by and use the first_value()
window function to get the substitute value which will be in the, be the order of the timestamps, first row of the rows sharing a common value, i.e. in the partition.
Something along the lines of:
SELECT customer_id,
transaction_id,
timestamp,
first_value(value1) OVER (PARTITION BY transaction_id,
cv1
ORDER BY timestamp ASC) AS value1,
first_value(value2) OVER (PARTITION BY transaction_id,
cv2
ORDER BY timestamp ASC) AS value2
FROM (SELECT customer_id,
transaction_id,
timestamp,
value1,
value2,
sum(value1 IS NOT NULL) OVER (PARTITION BY transaction_id
ORDER BY timestamp ASC) AS cv1,
sum(value2 IS NOT NULL) OVER (PARTITION BY transaction_id
ORDER BY timestamp ASC) AS cv2
FROM elbat) x;
(Untested because you failed to deliver the example as consumable DDL and DML as you should have.)
CodePudding user response:
An old method (also works in MySql 7.x), is to use variables.
select Customer_ID, transaction_ID, Timestamp, Value1, Value2 from ( select Customer_ID , Timestamp , case when (value1 is not null or transaction_ID != @transId) and @val1 := value1 then value1 else @val1 end as Value1 , case when (value2 is not null or transaction_ID != @transId) and @val2 := value2 then value2 else @val2 end as Value2 , @transId := transaction_ID as transaction_ID from your_table cross join (select @transId:=0, @val1:=0, @val2:='') vals order by Customer_ID, transaction_ID, Timestamp ) q
Customer_ID | transaction_ID | Timestamp | Value1 | Value2 |
---|---|---|---|---|
1 | 1 | 2022-01-01 17:00:00 | 1 | null |
1 | 1 | 2022-01-01 17:05:00 | 1 | Foo |
1 | 1 | 2022-01-01 17:10:00 | 1 | Bar |
1 | 1 | 2022-01-01 17:15:00 | 2 | Bar |
1 | 2 | 2022-01-01 17:20:00 | null | Wolf |
Side-note, the sort order in the sub-query matters in this method.
Another method is to use correlated sub-queries
select Customer_ID, transaction_ID, Timestamp
, coalesce(Value1, (select t2.Value1
from your_table t2
where t2.transaction_ID = t.transaction_ID
and t2.Value1 is not null
and t2.Timestamp < t.Timestamp
order by t2.Timestamp desc
limit 1
)) as Value1
, coalesce(Value2, (select t2.Value2
from your_table t2
where t2.transaction_ID = t.transaction_ID
and t2.Value2 is not null
and t2.Timestamp < t.Timestamp
order by t2.Timestamp desc
limit 1
)) as Value2
from your_table t
order by Customer_ID, transaction_ID, Timestamp;
Demo on db<>fiddle here