I have the following table:
date user_city
2000-01-01 amsterdam
2000-01-01 copenhagen
2000-01-01 amsterdam
2000-01-01 vienna
2000-01-01 prague
2000-01-02 vienna
2000-01-02 amsterdam
2000-01-02 tokio
2000-01-03 copenhagen
2000-01-03 london
2000-01-03 prague
2000-01-03 amsterdam
...
Now I would like to have a 7-day rolling average of the number of the user_city
values without repetition (so in case of 2000-01-01
the daily number is only 4, as amsterdam
is represented two times). What is the relevant SQL (MySQL) query to get the result?
Expected outcome (in case of a 3-day rolling average fitted to the example above):
window_start_date avg
2001-01-01 3.6666
2001-01-04 ?
2001-01-07 ?
I was able to produce the code in Python, it may helps understanding my aim:
df = original_table.groupby("date")["user_city"].nunique()
df = df.reset_index()
df["rolling_avg"] = df["user_city"].rolling(7).mean()
CodePudding user response:
After you edited your question, this will work for your three day rolling average.
First you can create a helper table, which will hold the start and end date of the windows you want to calculate the average over. You can do this by using a recursive cte
. It is called date_range
in the query below and will look like this:
window_start_date | window_end_date |
---|---|
2000-01-01 | 2000-01-03 |
2000-01-04 | 2000-01-06 |
2000-01-07 | 2000-01-09 |
... | ... |
To get the correct amount of visits per day, you can use another cte counting the DISTINCT
appearances of user_city and grouping by date. This is called daily_counts
in the query below.
You can now LEFT JOIN
the visits to the date range, using a conditional statement meaning the date of the visit is BETWEEN
the window_start_date and window_end_date.
Lastly, GROUP BY
the window_start_date and calculate the average number of visited cities.
WITH RECURSIVE date_range AS (
SELECT '2000-01-01' as window_start_date,
'2000-01-01' INTERVAL 2 day as window_end_date
UNION ALL
SELECT window_start_date INTERVAL 3 day,
window_end_date INTERVAL 3 day
from date_range
where window_start_date <= '2000-12-31'
),
daily_counts as (
SELECT date, COUNT(DISTINCT user_city) AS daily_count
FROM visits
GROUP BY date
)
SELECT window_start_date, AVG(daily_count) AS avg
FROM date_range
LEFT JOIN daily_counts
ON daily_counts.date BETWEEN window_start_date AND window_end_date
GROUP BY window_start_date
Added a dbfiddle which includes some more dates in the future. You can change the interval values in the date_range cte to get a 7 day rolling average.
CodePudding user response:
Here is a possible SQL query that you can use to calculate the 7-day rolling average of the number of unique user_city values:
SELECT date,
AVG(num_cities) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_num_cities
FROM (
SELECT date, COUNT(DISTINCT user_city) AS num_cities
FROM mytable
GROUP BY date
) t
This query first uses a subquery to calculate the number of unique user_city values for each date. Then, the outer query uses the AVG function with a window function to calculate the 7-day rolling average of this number. The window function uses the ROWS BETWEEN clause to specify a window of 7 days, starting from 6 days before the current row up to the current row.
The result of this query will be a table containing the date and the 7-day rolling average of the number of unique user_city values for each date. For example, for the data you provided, the result might look like this:
date avg_num_cities
2000-01-01 4
2000-01-02 4
2000-01-03 3.5
Note that the first six rows in the result will not have a complete 7-day window, so the average will be calculated based on a smaller number of days.
I hope this helps!