Home > Net >  Efficient way in Pandas to count occurrences of Series of values by row
Efficient way in Pandas to count occurrences of Series of values by row

Time:04-17

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:

data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
   A  B  C  D  E
0  3  4  1  1  4
1  2  3  2  1  3
2  1  2  3  2  2
3  0  1  4  2  4

How can I count the number of occurrences of specific values (given by a series with the same size) by row?

Again for simplicity, let's assume this value_series is given by the max of each row.

values_series = df.max(axis=1)
0    4
1    3
2    3
3    4
dtype: int64

The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):

max_count = []
for index, row in df.iterrows():
    max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)

Is there any more efficient way to do this?

CodePudding user response:

We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:

(df.T == df.max(axis=1)).sum()

# result
0    2
1    1
2    1
3    2
dtype: int64

(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)

CodePudding user response:

You can try

df.eq(df.max(1),axis=0).sum(1)
Out[361]: 
0    2
1    1
2    1
3    2
dtype: int64

CodePudding user response:

The perfect job for numpy broadcasting:

a = df.to_numpy()
b = values_series.to_numpy()[:, None]

(a == b).sum(axis=1)
  • Related