Home > Software engineering >  How to sum over some columns based on condition in pandas
How to sum over some columns based on condition in pandas

Time:04-21

I have a data frame like this:

mydf = {'p1':[0.1, 0.2, 0.3], 'p2':[0.2, 0.1,0.3], 'p3':[0.1,0.9, 0.01], 'p4':[0.11, 0.2, 0.4], 'p5':[0.3, 0.1,0.5],
        'w1':['cancel','hello', 'hi'], 'w2':['good','bad','ugly'], 'w3':['thanks','CUSTOM_MASK','great'],
        'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible'],'w5':['CUSTOM_MASK','CUSTOM_MASK','job']}
df = pd.DataFrame(mydf)

So what I need to do is to sum up all values in column p1,p2,p3,p4,p5 if the correspondent values in w1,w2,w3,w4,w5 is not CUSTOM_MASK or CUSTOM_UNKNOWN.

So the result would be to add a column to the data frame like this: (0.1 0.2 0.1=0.4 is for the first row).

top_p
0.4
0.3
1.51

So my question is that is there any pandas way to do this?

What I have done so far is to loop through the rows and then columns and check the values (CUSTOM_MASK, CUSTOM_UNKNOWN) and then sum it up if those values was not exist in the columns.

CodePudding user response:

You can use mask. The idea is to create a boolean mask with the w columns, and use it to filter the relevant w columns and sum:

df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)

Output:

    p1   p2    p3    p4   p5      w1    w2           w3              w4           w5  top_p
0  0.1  0.2  0.10  0.11  0.3  cancel  good       thanks     CUSTOM_MASK  CUSTOM_MASK   0.40
1  0.2  0.1  0.90  0.20  0.1   hello   bad  CUSTOM_MASK  CUSTOM_UNKNOWN  CUSTOM_MASK   0.30
2  0.3  0.3  0.01  0.40  0.5      hi  ugly        great          trible          job   1.51

Before summing, the output of mask looks like:

    p1   p2    p3   p4   p5
0  0.1  0.2  0.10  NaN  NaN
1  0.2  0.1   NaN  NaN  NaN
2  0.3  0.3  0.01  0.4  0.5

CodePudding user response:

Here's a way to do this using np.dot():

pCols, wCols = ['p' str(i   1) for i in range(5)], ['w' str(i   1)for i in range(5)]
mydf['top_p'] = mydf.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)

We first prepare the two sets of column names p1,...,p5 and w1,...,w5.

Then we use apply() to take the dot product of the values in the pN columns with the filtering criteria based on the wN columns (namely include only contributions from pN column values whose corresponding wN column value is not in the list of excluded strings).

Output:

    p1   p2    p3    p4   p5      w1    w2           w3              w4           w5  top_p
0  0.1  0.2  0.10  0.11  0.3  cancel  good       thanks     CUSTOM_MASK  CUSTOM_MASK   0.40
1  0.2  0.1  0.90  0.20  0.1   hello   bad  CUSTOM_MASK  CUSTOM_UNKNOWN  CUSTOM_MASK   0.30
2  0.3  0.3  0.01  0.40  0.5      hi  ugly        great          trible          job   1.51
  • Related