I have the following dataframe and I am trying to add a new column df['count'] that returns the count from each row based on the condition specified in df['E'].
df = pd.DataFrame({"A": [14, 4, 5, 4, 1],
"B": [5, 2, 54, 3, 2],
"C": [20, 20, 7, 3, 8],
"D": [14, 3, 6, 2, 6],
"E":[14, 2, 5, 3, 1]})
The output would look something like this:
A B C D E count
0 14 5 20 14 14 2
1 4 2 20 3 2 1
2 5 54 7 6 5 1
3 4 3 3 2 3 2
4 1 2 8 6 1 1
I think the function should look something like df['count'] = df[df == df['E']].count(axis=1). However, that is only returning 0s and includes df['E] in the output.
CodePudding user response:
You could use:
df['count'] = df.drop('E', axis=1).eq(df['E'], axis=0).sum(axis=1)
or:
# count all including E and remove 1
df['count'] = df.eq(df['E'], axis=0).sum(axis=1).sub(1)
output:
A B C D E count
0 14 5 20 14 14 2
1 4 2 20 3 2 1
2 5 54 7 6 5 1
3 4 3 3 2 3 2
4 1 2 8 6 1 1
note on the speed
Timing on 500k rows:
%%timeit
df.drop('E', axis=1).eq(df['E'], axis=0).sum(axis=1)
# 7.27 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.eq(df['E'], axis=0).sum(axis=1).sub(1)
# 5.88 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.apply(lambda x:sum(x["A":"D"]==x["E"]), axis=1)
# still running after 5 minutes… (takes ~10s on only 50k rows)
CodePudding user response:
An alternative could be:
df.loc[:,"count"]=df.apply(lambda x:sum(x["A":"D"]==x["E"]), axis=1)