I have a dataset that looks like this:
Value Type X_sq
-1.975767 Weather
-0.540979 Fruits
-2.359127 Fruits
-2.815604 Corona
-0.929755 Weather
I want to iterate through each row and calculate a sum of squares value for each row above (only if the Type matches). I want to put this value in the X.sq column.
So for example, in the first row, there's nothing above. So only (-1.975767 x -1.975767). In the second row, there's no FRUITS row above it, so it will just be -0.540979 x -0.540979. However, in the third row, when we scan all previous rows, we should find that FRUITS is already there. So we should get the last's FRUIT's ..... X_sq value and calculate a new sum of squares.
Value Type X_sq
-1.975767 Weather -1.975767 * -1.975767 = x
-0.540979 Fruits -0.540979 * -0.540979 = y
-2.359127 Fruits y ( -2.359127 x -2.359127)
-2.815604 Corona -2.815604 * -2.815604
-0.929755 Weather x (-0.929755 * -0.929755)
What would be an efficient way to do this?
def updateSS(X_sq, X_new):
return X_sq X_new**2
CodePudding user response:
Use:
df['X_sq'] = df['Value'].pow(2).groupby(df['Type']).cumsum()
print(df)
# Output
Value Type X_sq
0 -1.975767 Weather 3.903655
1 -0.540979 Fruits 0.292658
2 -2.359127 Fruits 5.858138
3 -2.815604 Corona 7.927626
4 -0.929755 Weather 4.768100
CodePudding user response:
You first build an id for consecutive rows with same type
groupid = (df['Type'] != df['Type'].shift()).cumsum()
You can then group the dataframe on it and again use cumsum
on each group:
df['X_sq'] = df.groupby(groupid)['Value'].transform(lambda x: (x*x).cumsum())
You should get as expected:
Value Type X_sq
0 -1.975767 Weather 3.903655
1 -0.540979 Fruits 0.292658
2 -2.359127 Fruits 5.858138
3 -2.815604 Corona 7.927626
4 -0.929755 Weather 0.864444