Home > Net >  How to find difference between dates and find first purchase in an eCommerce database
How to find difference between dates and find first purchase in an eCommerce database

Time:12-28

I am using Microsoft SQL Server Management Studio. I am trying to measure the customer retention rate of an eCommerce site.

For this, I need four values:

  • customer_id
  • order_purchase_timestamp
  • age_by_month
  • first_purchase

The values of age_by_month and first_purchase are not in my database. I want to calculate them.

In my database, I have customer_id and order_purchase_timestamp.

The first_purchase should be the earliest instance of order_purchase_timestamp. I only want the month and year.

The age_by_month should be the difference of months from first_purchase to order_purchase_timestamp.

I only want to measure the retention of the customer for each month so if two purchases are made in the same month it shouldn't be shown.

the dates are between 2016-10-01 to 2018-09-30. it should be ordered by order_purchase_timestamp

An example

customer_id order_purchase_timestamp
1 2016-09-04
2 2016-09-05
3 2016-09-05
3 2016-09-15
1 2016-10-04

to

customer_id first_purchase age_by_month order_purchase_timestamp
1 2016-09 0 2016-09-04
2 2016-09 0 2016-09-05
3 2016-09 0 2016-09-05
1 2016-09 1 2016-10-04

What I have done

SELECT 
    customer_id, order_purchase_timestamp
FROM
    orders
WHERE  
    (order_purchase_timestamp BETWEEN '2016-10-01' AND '2016-12-31') 
    OR (order_purchase_timestamp BETWEEN '2017-01-01' AND '2017-03-31') 
    OR (order_purchase_timestamp BETWEEN '2017-04-01' AND '2017-06-30') 
    OR (order_purchase_timestamp BETWEEN '2017-07-01' AND '2017-09-30') 
    OR (order_purchase_timestamp BETWEEN '2017-10-01' AND '2017-12-31') 
    OR (order_purchase_timestamp BETWEEN '2018-01-01' AND '2018-03-31') 
    OR (order_purchase_timestamp BETWEEN '2018-04-01' AND '2018-06-30') 
    OR (order_purchase_timestamp BETWEEN '2018-07-01' AND '2018-09-30')
ORDER BY 
    order_purchase_timestamp

Originally I was going to do it by quarters but I want to do it in months now.

CodePudding user response:

The following approach is designed to be relatively easy to understand. There are other ways (e.g., windowed functions) that may be marginally more efficient; but this makes it easy to maintain at your current SQL skill level.

Note that the SQL commands below build on one another (so the answer is at the end). To follow along, here is a db<>fiddle with the working.

It's based around a simple query (which we'll use as a sub-query) that finds the first order_purchase_timestamp for each customer.

SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id

The next thing is DATEDIFF to find the difference between 2 dates.

Then, you can use the above as a subquery to get the first date onto each row - then find the date difference e.g.,

SELECT orders.customer_id, 
       orders.order_purchase_timestamp,
       first_purchases.first_purchase_date,
       DATEDIFF(month, first_purchases.first_purchase_date, orders.order_purchase_timestamp) AS age_by_month
FROM   orders
       INNER JOIN
          (SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
           FROM orders
           GROUP BY customer_id
          ) AS first_purchases ON orders.customer_id = first_purchases.customer_id
  • Note - DATEDIFF has a 'gotcha' that gets most people but is good for you - when comparing months, it ignores the day component e.g., if finding the difference in months, there is 0 difference in months between 1 Jan and 31 Jan. On the other hand, there will be a difference on 1 month between 31 Jan and 1 Feb. However, I think this is actually what you want!

The above, however, repeats when a customer has multiple purchases within the month (it has one row per purchase). Instead, we can GROUP BY to group by the month it's in, then only take the first purchase for that month.

A 'direct' approach to this would be to group on YEAR(orders.order_purchase_timestamp) AND MONTH(orders.order_purchase_timestamp). However, I use a little trick below - using EOMONTH which finds the last day of the month. EOMONTH returns the same date for any date in that month; therefore, we can group by that.

Finally, you can add the WHERE expression and ORDER BY to get the results you asked for (between the two dates)

SELECT orders.customer_id, 
       MIN(orders.order_purchase_timestamp) AS order_purchase_timestamp,
       first_purchases.first_purchase_date,
       DATEDIFF(month, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)) AS age_by_month
FROM   orders
       INNER JOIN
          (SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
           FROM orders AS orders_ref
           GROUP BY customer_id
          ) AS first_purchases ON orders.customer_id = first_purchases.customer_id
WHERE  orders.order_purchase_timestamp BETWEEN '20161001' AND '20180930'
GROUP BY orders.customer_id, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)
ORDER BY order_purchase_timestamp;

Results - note they are different from yours because you wanted the earliest date to be 1/10/2016.

customer_id   order_purchase_timestamp   first_purchase_date        age_by_month
1             2016-10-04 00:00:00.000    2016-09-04 00:00:00.000    1

Edit: Because someone else will do it like this otherwise! You can do this with a single read-through that will potentially run a little faster. It is also a bit shorter - but harder to understand imo.

The below uses windows functions to calculate both the customer's earliest purchase, and the earliest purchase for each month (and uses DISTINCT rather than a GROUP BY). With that, it just does the DATEDIFF to calculate the difference.

WITH monthly_orders AS
    (SELECT DISTINCT orders.customer_id, 
           MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id, EOMONTH(orders.order_purchase_timestamp)) AS order_purchase_timestamp,
           MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id) AS first_purchase_date
     FROM   orders)
SELECT *, DATEDIFF(month, first_purchase_date, order_purchase_timestamp) AS age_by_month
FROM   monthly_orders
WHERE  order_purchase_timestamp BETWEEN '20161001' AND '20180930';

Note however this has one difference in the results. If you have 2 orders in a month, and your lowest date filter is between the to (e.g., orders on 15/10 and 20/10, and your minimum date is 16/10) then the row won't be included as the earliest purchase in the month is outside the filter range.

Also beware with both of these and what type of date or datetime field you are using - if you have datetimes rather than just dates, BETWEEN '20161001' AND '20180930' is not the same as >= '20161001' AND < '20181001'

CodePudding user response:

Here is short query that achieves all you want (descriptions of methods used are inline):

declare @test table (
    customer_id int,
    order_purchase_timestamp date
)
-- some test data
insert into @test values
(1, '2016-09-04'),
(2, '2016-09-05'),
(3, '2016-09-05'),
(3, '2016-09-15'),
(1, '2016-10-04');

select
    customer_id,
    -- takes care of correct display of first_purchase
    format(first_purchase, 'yyyy-MM') first_purchase,
    -- used to get the difference in months
    datediff(m, first_purchase, order_purchase_timestamp) age_by_month,
    order_purchase_timestamp
from (
    select 
        *,
        -- window function used to find min value for given column within group
        -- for each row
        min(order_purchase_timestamp) over (partition by customer_id) first_purchase
    from @test
) a
  • Related