How to calculate moving average of distinct value count in SQL?-CodePudding

I have the following table:

date       user_city 
2000-01-01 amsterdam
2000-01-01 copenhagen
2000-01-01 amsterdam
2000-01-01 vienna
2000-01-01 prague
2000-01-02 vienna
2000-01-02 amsterdam
2000-01-02 tokio
2000-01-03 copenhagen
2000-01-03 london
2000-01-03 prague
2000-01-03 amsterdam
...

Now I would like to have a 7-day rolling average of the number of the user_city values without repetition (so in case of 2000-01-01 the daily number is only 4, as amsterdam is represented two times). What is the relevant SQL (MySQL) query to get the result?

Expected outcome (in case of a 3-day rolling average fitted to the example above):

window_start_date avg
2001-01-01        3.6666
2001-01-04        ?
2001-01-07        ?

I was able to produce the code in Python, it may helps understanding my aim:

df = original_table.groupby("date")["user_city"].nunique()
df = df.reset_index()
df["rolling_avg"] = df["user_city"].rolling(7).mean()

CodePudding user response：

After you edited your question, this will work for your three day rolling average.

First you can create a helper table, which will hold the start and end date of the windows you want to calculate the average over. You can do this by using a recursive cte. It is called date_range in the query below and will look like this:

window_start_date	window_end_date
2000-01-01	2000-01-03
2000-01-04	2000-01-06
2000-01-07	2000-01-09
...	...

To get the correct amount of visits per day, you can use another cte counting the DISTINCT appearances of user_city and grouping by date. This is called daily_counts in the query below.

You can now LEFT JOIN the visits to the date range, using a conditional statement meaning the date of the visit is BETWEEN the window_start_date and window_end_date.

Lastly, GROUP BY the window_start_date and calculate the average number of visited cities.

WITH RECURSIVE date_range AS (
    SELECT '2000-01-01' as window_start_date, 
          '2000-01-01'   INTERVAL 2 day as window_end_date
    UNION ALL
    SELECT window_start_date   INTERVAL 3 day, 
           window_end_date   INTERVAL 3 day
    from date_range
    where window_start_date <= '2000-12-31'
), 

  daily_counts as (
    SELECT date, COUNT(DISTINCT user_city) AS daily_count
    FROM visits
    GROUP BY date
)

SELECT window_start_date, AVG(daily_count) AS avg
FROM date_range
LEFT JOIN daily_counts
  ON daily_counts.date BETWEEN window_start_date AND window_end_date
GROUP BY window_start_date

Added a dbfiddle which includes some more dates in the future. You can change the interval values in the date_range cte to get a 7 day rolling average.

CodePudding user response：

Here is a possible SQL query that you can use to calculate the 7-day rolling average of the number of unique user_city values:

SELECT date,
       AVG(num_cities) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_num_cities
FROM (
    SELECT date, COUNT(DISTINCT user_city) AS num_cities
    FROM mytable
    GROUP BY date
) t

This query first uses a subquery to calculate the number of unique user_city values for each date. Then, the outer query uses the AVG function with a window function to calculate the 7-day rolling average of this number. The window function uses the ROWS BETWEEN clause to specify a window of 7 days, starting from 6 days before the current row up to the current row.

The result of this query will be a table containing the date and the 7-day rolling average of the number of unique user_city values for each date. For example, for the data you provided, the result might look like this:

date        avg_num_cities
2000-01-01  4
2000-01-02  4
2000-01-03  3.5

Note that the first six rows in the result will not have a complete 7-day window, so the average will be calculated based on a smaller number of days.

I hope this helps!