Python Pandas Sumif by conditional column value-CodePudding

I have the following dataframe and I am trying to add a new column df['count'] that returns the count from each row based on the condition specified in df['E'].

df = pd.DataFrame({"A": [14, 4, 5, 4, 1],
                   "B": [5, 2, 54, 3, 2],
                   "C": [20, 20, 7, 3, 8],
                   "D": [14, 3, 6, 2, 6],
                   "E":[14, 2, 5, 3, 1]})

The output would look something like this:

    A   B   C   D   E  count
0  14   5  20  14  14      2
1   4   2  20   3   2      1
2   5  54   7   6   5      1
3   4   3   3   2   3      2
4   1   2   8   6   1      1

I think the function should look something like df['count'] = df[df == df['E']].count(axis=1). However, that is only returning 0s and includes df['E] in the output.

CodePudding user response：

You could use:

df['count'] = df.drop('E', axis=1).eq(df['E'], axis=0).sum(axis=1)

or:

# count all including E and remove 1
df['count'] = df.eq(df['E'], axis=0).sum(axis=1).sub(1)

output:

    A   B   C   D   E  count
0  14   5  20  14  14      2
1   4   2  20   3   2      1
2   5  54   7   6   5      1
3   4   3   3   2   3      2
4   1   2   8   6   1      1

note on the speed

Timing on 500k rows:

%%timeit
df.drop('E', axis=1).eq(df['E'], axis=0).sum(axis=1)
# 7.27 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.eq(df['E'], axis=0).sum(axis=1).sub(1)
# 5.88 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.apply(lambda x:sum(x["A":"D"]==x["E"]), axis=1)
# still running after 5 minutes… (takes ~10s on only 50k rows)

CodePudding user response：

An alternative could be:

df.loc[:,"count"]=df.apply(lambda x:sum(x["A":"D"]==x["E"]), axis=1)