pandas computing new column as a average of other two conditions-CodePudding

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.

So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.

I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:

df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()

The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:

df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()

It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.

CodePudding user response：

You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.

Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe

df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']

This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

CodePudding user response：

The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.

There is two ways to solve this problem. The first one is using transform as:

df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())

The second is to create a new dfn from groupby then merge back to df

dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']