I have different age groups data and different months, How can I convert it to Annually in Python?-CodePudding

(https://i.stack.imgur.com/58D8K.png)

df = pd.read_csv('1410001701eng.csv')
df.head()
df['date'] = pd.to_datetime(df['Age group'])
df['year'] = pd.DatetimeIndex(df['date']).year
monthly_year_avg = df.groupby('year')['VALUE'].mean()
print(monthly_year_avg)

This is my code. Could you please tell me or give me a hint or show me the website has similar questions. I have monthly data from Jan-1978 to November-2022. How can I convert all these monthly data from different age groups to annually by taking average?

or do you think I should calculate it one by one is Excel? Cause it only 44 years.

Thank you very much! Much appreciated

I tried search similar questions in reddit forum and Stack overflow, they all used rsample and get the result.

I have monthly data from Jan-1978 to November-2022. How can I convert all these monthly data from different age groups to annually by taking average?

CodePudding user response：

This should give you a new pandas dataframe with the yearly mean. Note that the if statement has a subtract by 1 on the timestep to account for no December column for 2022.

new_df = pd.DataFrame() #create empty pandas dataframe
time_step = 12 #years
for i in np.arange(0, len(df.columns), time_step):
    new_header = df.columns[i][-2:]

    if new_header == str(22): #If the year is 2022
        sliced_for_mean = df.iloc[:, i:i time_step-1] #take one off from the last step (no December column)
        new_df[new_header] = sliced_for_mean.mean(axis=1) #means for each row appended to new_df

    else: #else do this
        sliced_for_mean = df.iloc[:, i:i time_step] #sliced df to calculate mean for year
        new_df[new_header] = sliced_for_mean.mean(axis=1) #means for each row appended to new_df

print(new_df)

CodePudding user response：

melt month columns to a single column "month", and extract year value from month. Then aggregate by year:

df = pd.DataFrame(data=[
    ["M", "21-30", None, None, 15000, 21000, 22500, 21800, None, None, None],
    ["M", "31-40", 18000, 19200, 19000, None, None, 21800, 21500, 22300, 22000],
    ["M", "41-50", 22200, None, 15000, 21000, 22500, 21800, None, None, 22000],
], columns=["gender", "age_group", "Nov-20", "Dec-20", "Mar-21", "Apr-21", "May-21", "Jun-21", "Jan-22", "Feb-22", "Mar-22"])

df = df.fillna(0)

df = df.melt(id_vars=["gender", "age_group"], value_vars=df.drop(["gender", "age_group"], axis=1).columns, var_name="month", value_name="value")

df["year"] = df["month"].str.split("-").str[1]

df = df.groupby(["gender", "age_group", "year"]).agg(avg=("value", np.mean)).reset_index()

[Out]
   gender age_group year           avg
0       M     21-30   20      0.000000
1       M     21-30   21  20075.000000
2       M     21-30   22      0.000000
3       M     31-40   20  18600.000000
4       M     31-40   21  10200.000000
5       M     31-40   22  21933.333333
6       M     41-50   20  11100.000000
7       M     41-50   21  20075.000000
8       M     41-50   22   7333.333333