Home > Mobile >  Filter timeseries data based on multiple criteria
Filter timeseries data based on multiple criteria

Time:12-05

I have a table where I store timeseries data:

customer_id transaction_type transaction_date transaction_value
1 buy 2022-12-04 100.0
1 sell 2022-12-04 80.0
2 buy 2022-12-04 120.0
2 sell 2022-12-03 120.0
1 buy 2022-12-02 90.0
1 sell 2022-12-02 70.0
2 buy 2022-12-01 110.0
2 sell 2022-12-01 110.0

Number of customers and transaction types is not limited. Currently there are over 10,000 customers and over 600 transaction types. Dates of transactions between customers can be unique and will not always align based on any criteria (that's why I've tried using LATERAL JOIN — you'll see it later).

I want to filter those records to get customers IDs with the values of the transaction where any arbitrary condition is met. Number of those conditions in a query is not restricted to two — can be anything. For example:

Give me all customers who have a buy with value > $90 and a sale with value > 100$ as their latest transactions

The final query should return these two rows:

customer_id transaction_type transaction_date transaction_value
2 buy 2022-12-04 120$
2 sell 2022-12-03 120$

The closest I've came to what I need was by creating a materialized view cross-joining customer IDs and transaction_types:

customer_id transaction_type
1 buy
1 sell
2 buy
2 sell

And then running a LATERAL JOIN between table with transactions and customer_transactions materialized view:

SELECT *
  FROM customer_transactions
  JOIN LATERAL (
    SELECT *
      FROM transactions
     WHERE (transactions.customer_id = customer_transactions.customer_id)
       AND (transactions.transaction_type = customer_transactions.transaction_type)
       AND transactions.transaction_date <= '2022-12-04' -- this can change for filtering records back in time
     ORDER BY transactions.transaction_date DESC
     LIMIT 1
  ) transactions ON TRUE
 WHERE customer_transactions.transaction_type = 'buy'
   AND customer_transactions.transaction_value > 90

It seems to be working when one condition is specified. But as soon as subsequential conditions are introduced that's where things start falling apart for me; changing condition to:

 WHERE (customer_transactions.transaction_type = 'buy'
   AND customer_transactions.transaction_value > 90)
   AND (customer_transactions.transaction_type = 'sell'
   AND customer_transactions.transaction_value > 100)

is obviously not going to work as there is no row that satisfies both of these conditions.

Is it possible to achieve this using the aproach I took? If so what am I missing? Or maybe there is another way to solve that would be more appropriate?

CodePudding user response:

You could use a CTE with row_number and chech out the last transactios

WITH CTE as (SELECT
"customer_id", "transaction_type", "transaction_date",
  "transaction_value",
ROW_NUMBER() OVER(PARTITION BY "customer_id", "transaction_type" ORDER BY  "transaction_date" DESC) rn
FROM tab1)
SELECT "customer_id", "transaction_type", "transaction_date",
  "transaction_value" FROM CTE
  WHERE rn = 1 
  AND CASE WHEN "transaction_type" = 'buy' THEN ("transaction_value" > 90) 
WHEN "transaction_type" = 'sell' THEN ("transaction_value" > 100) 
ELSE FALSE END 
AND (SELECT COUNT(*) FROM CTE c1 
  WHERE c1."customer_id"= CTE."customer_id" and rn = 1
    AND CASE WHEN "transaction_type" = 'buy' THEN ("transaction_value" > 90) 
WHEN "transaction_type" = 'sell' THEN ("transaction_value" > 100) 
ELSE FALSE END ) = 2
customer_id transaction_type transaction_date transaction_value
2 buy 2022-12-04 120.0
2 sell 2022-12-03 120.0
SELECT 2

fiddle

CodePudding user response:

Use distinct on with custom order to select all the latest transactions per user according to your several criteria (hence the OR) - latest CTE, then count the number of result records per user using count as a window function - latest_with_count CTE - and finally pick these that have a count equal to the number of criteria.
This may be a bit abstract template but hopefully helps with the generic problem. The approach would work for any number of conditions.

with t as
(
 /*
  your query here with several conditions in DISJUNCTION (OR) here, i.e.
  WHERE (customer_transactions.transaction_type = 'buy'  AND customer_transactions.transaction_value > 90)
     OR (customer_transactions.transaction_type = 'sell' AND customer_transactions.transaction_value > 100)
 */
),
latest as 
(
 select distinct on (customer_id, transaction_type) *
 from t
 -- pick the latest per customer & type
 order by customer_id, transaction_type, transaction_date desc
),
latest_with_count as
(
 select *, count(*) over (partition by customer_id) cnt
 from latest
)
select * 
from latest_with_count
where cnt = 2 -- the number of criteria
  • Related