I'm trying to do something that seems completely non readable, and I was wondering if I could make it simplier ( And I bet I could, but can't find how ).
This is my code :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df['D'] = [1,2,2,1,2]
df1 = df.query('B<1')[['A', 'D']].groupby('D').sum().reset_index()
df = df.set_index(['D'])
df1 = df1.set_index(['D'])
df2 = df1.join(df[['B', 'C']], how='inner', on=['D']).reset_index()
df2.loc[df2['B'] > 1, 'A'] = 0
df
A B C
D
1 0.702204 2.288548 -0.251334
2 -0.346842 -1.486899 -0.576246
2 -1.183607 -2.210152 0.409037
1 -0.884401 0.124899 1.719387
2 0.305400 0.988187 0.160168
df2
D A B C
0 1 0.000000 2.288548 -0.251334
1 1 -0.884401 0.124899 1.719387
2 2 -1.225049 -1.486899 -0.576246
3 2 -1.225049 -2.210152 0.409037
4 2 -1.225049 0.988187 0.160168
This is what I came up, and it is correct but not really readable.
I need:
- To query every line where B < 1
- Group by another column ( D )
- Sum a third column ( A )
- Put 0 in every line not matching query in 1
Is it possible to do that without that ugly code?
Thanks,
CodePudding user response:
First sort values by D
if need same order like df2
, then convert A
to 0
if greater like 1
, so possible use GroupBy.transform
with sum
for new Series with same size like original DataFrame, last set 0
if B
is greater like 1
and assign to column A
:
#query 4 use query 1, so possible reuse mask
m = df.B.lt(1)
df['A'] = df['A'].where(m, 0).groupby(df['D']).transform('sum').where(m,0)
print (df)
D A B C
0 1 0.000000 2.288548 -0.251334
1 2 -1.225049 -1.486899 -0.576246
2 2 -1.225049 -2.210152 0.409037
3 1 -0.884401 0.124899 1.719387
4 2 -1.225049 0.988187 0.160168
How it working:
print (df['A'].where(m, 0))
0 0.000000
1 -1.225049
2 -1.225049
3 -0.884401
4 -1.225049
Name: A, dtype: float64
print (df['A'].where(m, 0).groupby(df['D']).transform('sum'))
0 -0.884401
1 -3.675147
2 -3.675147
3 -0.884401
4 -3.675147
Name: A, dtype: float64
print (df['A'].where(m, 0).groupby(df['D'])
.transform('sum').where(m,0))
0 0.000000
1 -3.675147
2 -3.675147
3 -0.884401
4 -3.675147
Name: A, dtype: float64