as a part of a course I'm taking, I need to calculate the monthly cumulative sums of rocket launches and calculate month-to-month rolling averages, and finally show both results on a plot.
The dataframe contains Date
, Launch site
, Mission status
and some other less important parameters. Date is formatted as YYYY-MM-DD, and the number of items per different year-month combinations varies.
The input data looks like this:
Organisation Location Date Detail Rocket_Status Price Mission_Status
4323 RVSN USSR Site 1/5, Baikonur Cosmodrome, Kazakhstan 1957-10-04 19:28:00 00:00 Sputnik 8K71PS | Sputnik-1 StatusRetired 63.23 Success
4322 RVSN USSR Site 1/5, Baikonur Cosmodrome, Kazakhstan 1957-11-03 02:30:00 00:00 Sputnik 8K71PS | Sputnik-2 StatusRetired 63.23 Success
4321 US Navy LC-18A, Cape Canaveral AFS, Florida, USA 1957-12-06 16:44:00 00:00 Vanguard | Vanguard TV3 StatusRetired 63.23 Failure
My approach to this is:
- create a separate df using groupby and egg function:
launchdays = pd.DataFrame(spacerace.groupby(spacerace.loc[:,"Date"].dt.date).agg("size"))
This dataframe contains the following data:
0
Date
1957-10-04 1
1957-11-03 1
1957-12-06 1
1958-02-01 1
1958-02-05 1
- change index to
DateTimeIndex
:
launchdays.index = pd.to_datetime(launchdays.index)
- calculate the per month number of launches and month-to-month rolling average:
monthly_data = launchdays.resample("M").sum()
rolling = launchdays.rolling(30).mean().dropna()
Monthly data after resampling:
0
Date
1957-10-31 1
1957-11-30 1
1957-12-31 1
1958-01-31 0
1958-02-28 2
Rolling average:
0
Date
1958-12-18 1.03
1959-01-02 1.03
1959-02-17 1.03
1959-02-28 1.03
1959-04-13 1.03
This seems to work, and the outputs are, at least to my eye, realistic — however, I want to make sure if my approach is correct and/or if this can be done more elegantly.
Thanks!
Another edit:
I have fixed the part collecting the monthly data and calculating the rolling average:
monthly_data = launchdays.resample("30D").sum()
rolling = monthly_data.rolling(“30D").mean()
However, they both produce exactly the same output now, which is kind of expected... Perhaps I misunderstood something from this exercise? This task is defined as: "Chart the Number of Launches Month-on-Month until the Present. Which month has seen the highest number of launches in all time? Superimpose a rolling average on the month on month time series chart."
CodePudding user response:
I do not know if I understand your actual question, I am not a fan of debating about elegant vs inelegant solutions. If they work, they are good. If another solution is better depends on the way you compare different solutions to the same problem, e.g. requires less time, or less memory, or less lines of code, etc.
Coming back to your question, there is a difference between the rolling average and the resampling sum. The rolling average is a method to smooth your data in order to give the correct trend, see https://en.wikipedia.org/wiki/Moving_average. In contrast, the resample and sum method is a data aggregation on binned data, basically a histogram https://en.wikipedia.org/wiki/Histogram.
So if the question is about which month has the largest number of launches, you need to calculate the histogram and find the maximum.
The rolling average part in your exercise is not well defined, because it does not give a window size or at least gives more information why you should smooth the data. It should certainly be more than 30 days, because there are months with more than 30 days. I guess they mean something like a year (12 months) window, but this is pure speculation.
Edit: I think they mean something like this:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(7)
pd.plotting.register_matplotlib_converters()
# create dummy data
dates = (
pd.Timestamp('1970-01-01')
pd.to_timedelta(np.random.randint(0, 19000, 20000), unit='D')
)
success = np.random.random(len(dates)) > 0.05
df = pd.DataFrame({
'date': dates,
'success': success
})
df.sort_values('date', inplace=True)
df.reset_index(drop=True, inplace=True)
# create histogram
monthly = df['date'].groupby(
df['date'].dt.to_period('M').dt.to_timestamp()
).count()
print(monthly.loc[monthly==monthly.max()])
# add zeros for months that do not appear in the original histogram
monthly_dense = monthly.resample('MS').sum()
# plot
fig, ax = plt.subplots()
ax.plot(monthly_dense.index, monthly_dense)
ax.plot(monthly_dense.index, monthly_dense.rolling(12).mean())
fig.autofmt_xdate()