I have a long list of columns with numerical values. Each column has an associated dummy variable. I want to create a new column with row-wise sums of all columns in my list for that the associated dummy equals 1.
Because of the length of the data, specifying conditions for each column individually is not feasible.
Example data:
df = pd.DataFrame(np.random.randint(0,100,size=(4, 3)), columns=('A', 'B', 'C'))
dummy_array = np.array([[1, 0, 0],[1, 1, 1], [0, 0, 0], [0, 1, 1]])
df[['A_dummy', 'B_dummy', 'C_dummy']] = pd.DataFrame(index=df.index,data = dummy_array)
print(df)
A B C A_dummy B_dummy C_dummy
0 33 56 75 1 0 0
1 12 99 49 1 1 1
2 7 92 84 0 0 0
3 95 19 79 0 1 1
desired solution:
A B C A_dummy B_dummy C_dummy ABC_sum
0 33 56 75 1 0 0 33
1 12 99 49 1 1 1 160
2 7 92 84 0 0 0 0 (or NaN)
3 95 19 79 0 1 1 98
I fail to come of up for a solution that would go through a list = ['A', 'B', 'C'] and do this. Thanks for the input.
The (embarassingly easy) solution for this, provided by @mozway:
abclist = ['A', 'B', 'C'] # I used a regex-filter to create this list
abc_dummylist = [item '_dummy' for item in abclist]
df['ABC_sum'] = (df[abclist ]*df[abc_dummylist ].values).sum(axis=1)
print(df)
A B C A_dummy B_dummy C_dummy ABC_sum
0 70 24 10 1 0 0 70
1 7 15 3 1 1 1 25
2 25 74 23 0 0 0 0
3 22 72 77 0 1 1 149
CodePudding user response:
You could do:
df['ABC_sum'] = (df[['A', 'B', 'C']]*df[['A_dummy', 'B_dummy', 'C_dummy']].values).sum(axis=1)
NB. if you have many columns, you could select them with filter
example:
A B C A_dummy B_dummy C_dummy ABC_sum
0 92 12 74 1 0 0 92
1 40 59 36 1 1 1 135
2 88 70 40 0 0 0 0
3 78 49 53 0 1 1 102