I had a dataset that looked like this:
Value Type mean
-1.975767 Weather
-0.540979 Fruits
-2.359127 Fruits
-2.815604 Corona
-0.929755 Weather
I wanted to iterate through each row and calculate a mean value for each row above (only if the Type matches). Mean is calculated by:
sum of all values / number of observations
where number of observations will be the number of times a Type has occurred so far.
For example, in the first row, there's no "weather" row above so for weather n = 1. So the mean would be -1.975767 / 1 = -1.975767.
In the second row, there's no FRUITS row above it, so the mean will just be -0.540979/1 = -0.540979.
However, in the third row, when we scan all previous rows, we see that FRUITS has already occurred before this and hence, n = 2 for Fruits. So we should get the last's FRUIT's value and calculate a new mean. So here, the mean will be -0.540979 (-2.359127) divided by 2.
Value Type mean
-1.975767 Weather -1.975767
-0.540979 Fruits -0.540979
-2.359127 Fruits (-0.540979 -2.359127) / 2
-2.815604 Corona -2.815604
-0.929755 Weather (-1.975767 -0.929755) / 2
I used this to achieve this and it worked fine:
df['mean'] = df.groupby('type', as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)
However, now I want to do the same thing based on grouping of two cols Such that Country and Type both match.
Value Type mean Country
-1.975767 Weather Albania
-0.540979 Fruits Brazil --should be grouped
-2.359127 Fruits Brazil --should be grouped
-2.815604 Corona Albania
-0.929755 Weather China
I tried this:
df['mean'] = df.groupby([df.type,df.country], as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)
However, this gives me an error that:
TypeError: incompatible index of inserted column with frame index
even though its almost the same thing. What am I doing wrong?
CodePudding user response:
Try:
df["Mean"] = df.groupby(["Type", "Country"])["Value"].expanding().mean().droplevel([0,1]).sort_index()
>>> df
Value Type Country Mean
0 -1.975767 Weather Albania -1.975767
1 -0.540979 Fruits Brazil -0.540979
2 -2.359127 Fruits Brazil -1.450053
3 -2.815604 Corona Albania -2.815604
4 -0.929755 Weather China -0.929755
Input df:
df = pd.DataFrame({"Value": [-1.975767, -0.540979, -2.359127, -2.815604, -0.929755],
"Type": ["Weather", "Fruits", "Fruits", "Corona", "Weather"],
"Country": ["Albania", "Brazil", "Brazil", "Albania", "China"]})