Cumulative number of data frame rows with groupby and rolling average-CodePudding

as a part of a course I'm taking, I need to calculate the monthly cumulative sums of rocket launches and calculate month-to-month rolling averages, and finally show both results on a plot.

The dataframe contains Date, Launch site, Mission status and some other less important parameters. Date is formatted as YYYY-MM-DD, and the number of items per different year-month combinations varies.

The input data looks like this:

Organisation    Location    Date    Detail  Rocket_Status   Price   Mission_Status
4323    RVSN USSR   Site 1/5, Baikonur Cosmodrome, Kazakhstan   1957-10-04 19:28:00 00:00   Sputnik 8K71PS | Sputnik-1  StatusRetired   63.23   Success
4322    RVSN USSR   Site 1/5, Baikonur Cosmodrome, Kazakhstan   1957-11-03 02:30:00 00:00   Sputnik 8K71PS | Sputnik-2  StatusRetired   63.23   Success
4321    US Navy LC-18A, Cape Canaveral AFS, Florida, USA    1957-12-06 16:44:00 00:00   Vanguard | Vanguard TV3 StatusRetired   63.23   Failure

My approach to this is:

create a separate df using groupby and egg function:

launchdays = pd.DataFrame(spacerace.groupby(spacerace.loc[:,"Date"].dt.date).agg("size"))

This dataframe contains the following data:

            0
Date    
1957-10-04  1
1957-11-03  1
1957-12-06  1
1958-02-01  1
1958-02-05  1

change index to DateTimeIndex:

launchdays.index = pd.to_datetime(launchdays.index)

calculate the per month number of launches and month-to-month rolling average:

monthly_data = launchdays.resample("M").sum()
rolling = launchdays.rolling(30).mean().dropna()

Monthly data after resampling:

            0
Date    
1957-10-31  1
1957-11-30  1
1957-12-31  1
1958-01-31  0
1958-02-28  2

Rolling average:

            0
Date    
1958-12-18  1.03
1959-01-02  1.03
1959-02-17  1.03
1959-02-28  1.03
1959-04-13  1.03

This seems to work, and the outputs are, at least to my eye, realistic — however, I want to make sure if my approach is correct and/or if this can be done more elegantly.

Thanks!

Another edit:

I have fixed the part collecting the monthly data and calculating the rolling average:

monthly_data = launchdays.resample("30D").sum()
rolling = monthly_data.rolling(“30D").mean()

However, they both produce exactly the same output now, which is kind of expected... Perhaps I misunderstood something from this exercise? This task is defined as: "Chart the Number of Launches Month-on-Month until the Present. Which month has seen the highest number of launches in all time? Superimpose a rolling average on the month on month time series chart."

CodePudding user response：

I do not know if I understand your actual question, I am not a fan of debating about elegant vs inelegant solutions. If they work, they are good. If another solution is better depends on the way you compare different solutions to the same problem, e.g. requires less time, or less memory, or less lines of code, etc.

Coming back to your question, there is a difference between the rolling average and the resampling sum. The rolling average is a method to smooth your data in order to give the correct trend, see https://en.wikipedia.org/wiki/Moving_average. In contrast, the resample and sum method is a data aggregation on binned data, basically a histogram https://en.wikipedia.org/wiki/Histogram.

So if the question is about which month has the largest number of launches, you need to calculate the histogram and find the maximum.

The rolling average part in your exercise is not well defined, because it does not give a window size or at least gives more information why you should smooth the data. It should certainly be more than 30 days, because there are months with more than 30 days. I guess they mean something like a year (12 months) window, but this is pure speculation.

Edit: I think they mean something like this:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.random.seed(7)
pd.plotting.register_matplotlib_converters()

# create dummy data
dates = (
    pd.Timestamp('1970-01-01')  
    pd.to_timedelta(np.random.randint(0, 19000, 20000), unit='D')
)
success = np.random.random(len(dates)) > 0.05
df = pd.DataFrame({
    'date': dates,
    'success': success
})
df.sort_values('date', inplace=True)
df.reset_index(drop=True, inplace=True)

# create histogram
monthly = df['date'].groupby(
    df['date'].dt.to_period('M').dt.to_timestamp()
).count()
print(monthly.loc[monthly==monthly.max()])

# add zeros for months that do not appear in the original histogram
monthly_dense = monthly.resample('MS').sum()

# plot
fig, ax = plt.subplots()
ax.plot(monthly_dense.index, monthly_dense)
ax.plot(monthly_dense.index, monthly_dense.rolling(12).mean())
fig.autofmt_xdate()