I have a data frame like this:
mydf = {'p1':[0.1, 0.2, 0.3], 'p2':[0.2, 0.1,0.3], 'p3':[0.1,0.9, 0.01], 'p4':[0.11, 0.2, 0.4], 'p5':[0.3, 0.1,0.5],
'w1':['cancel','hello', 'hi'], 'w2':['good','bad','ugly'], 'w3':['thanks','CUSTOM_MASK','great'],
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible'],'w5':['CUSTOM_MASK','CUSTOM_MASK','job']}
df = pd.DataFrame(mydf)
So what I need to do is to sum up all values in column p1,p2,p3,p4,p5
if the correspondent values in w1,w2,w3,w4,w5
is not CUSTOM_MASK
or CUSTOM_UNKNOWN
.
So the result would be to add a column to the data frame like this: (0.1 0.2 0.1=0.4 is for the first row).
top_p
0.4
0.3
1.51
So my question is that is there any pandas way to do this?
What I have done so far is to loop through the rows and then columns and check the values (CUSTOM_MASK, CUSTOM_UNKNOWN)
and then sum it up if those values was not exist in the columns.
CodePudding user response:
You can use mask
. The idea is to create a boolean mask with the w
columns, and use it to filter the relevant w
columns and sum
:
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Before sum
ming, the output of mask
looks like:
p1 p2 p3 p4 p5
0 0.1 0.2 0.10 NaN NaN
1 0.2 0.1 NaN NaN NaN
2 0.3 0.3 0.01 0.4 0.5
CodePudding user response:
Here's a way to do this using np.dot()
:
pCols, wCols = ['p' str(i 1) for i in range(5)], ['w' str(i 1)for i in range(5)]
mydf['top_p'] = mydf.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
We first prepare the two sets of column names p1,...,p5
and w1,...,w5
.
Then we use apply()
to take the dot product of the values in the pN
columns with the filtering criteria based on the wN
columns (namely include only contributions from pN
column values whose corresponding wN
column value is not in the list of excluded strings).
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51