Home > Enterprise >  SQL - Counting users that have multiple transactions and have at least one transaction that has been
SQL - Counting users that have multiple transactions and have at least one transaction that has been

Time:02-14

Dataset Here is the task : Count users that have multiple transactions and have at least one transaction that has been made within 7 days interval of the other one.

Structure of dataset: Row, userId, orderId, date

Date is formatted as YYYY-MM-DDTHH:MM:SS Example: 2016-09-16T11:32:06

I have completed the first part (counting users with multiple transactions), but I do not know how to do the second part in the same query. I will be thankful for help.

Here is the console:

query = '''
SELECT COUNT(*) 
FROM 
  (SELECT userId FROM `dataset` GROUP BY userId HAVING COUNT(orderId) > 1)
'''

project_id = 'acdefg'

df = pd.io.gbq.read_gbq(query, project_id=project_id, dialect='standard')
display(df)

CodePudding user response:

To solve this issue you want to be able to compare each record to a previous record: when was the last order from the same user. This hints to the use of partitions and window functions, in this case LAG.

A possible way to solve the problem is to organise records per user and order them by orderDate and then for each record have a look at the record just above:

WITH intermediate_table AS (
    SELECT
        userId,
        orderDate,
        LAG(orderDate) 
            OVER (PARTITION BY userId ORDER BY orderDate) -- this is where we pick the orderDate of the record right above, once the orders are organized by userId and ordered by orderDate
    FROM `dataset.table`
)
    
SELECT userId
FROM intermediate_table
WHERE DATE_DIFF(orderDate, previous_order, DAY) <= 7
GROUP BY userId

Once orderDate and previous_order info are gathered in the same record, it's easy to compare them and see if there is less than 7 days between the two.

(GROUP BY is used for returning userIds only once in the resulting table)

CodePudding user response:

This may be what you need:

-- for each order calculate the days since that customer's last order
order_profiler AS (
  SELECT
    orderId,
    orderDate,
    custId,
    DATE_DIFF(orderDate, LAG(orderDate) OVER (PARTITION BY custId ORDER BY orderDate), day) AS order_latency_days,
  FROM
    `dataset.table` 
)

SELECT 
custId,
FROM order_profiler
WHERE order_latency_days <= 7
GROUP BY custId
  • Related