Home > Back-end >  find mean by grouping two columns
find mean by grouping two columns

Time:02-15

I had a dataset that looked like this:

Value         Type       mean
-1.975767     Weather   
-0.540979     Fruits
-2.359127     Fruits
-2.815604     Corona
-0.929755     Weather

I wanted to iterate through each row and calculate a mean value for each row above (only if the Type matches). Mean is calculated by:

sum of all values / number of observations

where number of observations will be the number of times a Type has occurred so far.

For example, in the first row, there's no "weather" row above so for weather n = 1. So the mean would be -1.975767 / 1 = -1.975767.

In the second row, there's no FRUITS row above it, so the mean will just be -0.540979/1 = -0.540979.

However, in the third row, when we scan all previous rows, we see that FRUITS has already occurred before this and hence, n = 2 for Fruits. So we should get the last's FRUIT's value and calculate a new mean. So here, the mean will be -0.540979 (-2.359127) divided by 2.

Value         Type       mean
-1.975767     Weather   -1.975767
-0.540979     Fruits    -0.540979
-2.359127     Fruits    (-0.540979 -2.359127)  / 2
-2.815604     Corona    -2.815604
-0.929755     Weather   (-1.975767 -0.929755) / 2

I used this to achieve this and it worked fine:

df['mean'] = df.groupby('type', as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)

However, now I want to do the same thing based on grouping of two cols Such that Country and Type both match.

Value         Type       mean    Country
-1.975767     Weather            Albania
-0.540979     Fruits             Brazil      --should be grouped
-2.359127     Fruits             Brazil      --should be grouped
-2.815604     Corona             Albania
-0.929755     Weather            China

I tried this:

df['mean'] = df.groupby([df.type,df.country], as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)

However, this gives me an error that:

TypeError: incompatible index of inserted column with frame index

even though its almost the same thing. What am I doing wrong?

CodePudding user response:

Try:

df["Mean"] = df.groupby(["Type", "Country"])["Value"].expanding().mean().droplevel([0,1]).sort_index()

>>> df
      Value     Type  Country      Mean
0 -1.975767  Weather  Albania -1.975767
1 -0.540979   Fruits   Brazil -0.540979
2 -2.359127   Fruits   Brazil -1.450053
3 -2.815604   Corona  Albania -2.815604
4 -0.929755  Weather    China -0.929755
Input df:
df = pd.DataFrame({"Value": [-1.975767, -0.540979, -2.359127, -2.815604, -0.929755],
                   "Type": ["Weather", "Fruits", "Fruits", "Corona", "Weather"],
                   "Country": ["Albania", "Brazil", "Brazil", "Albania", "China"]})
  • Related