How to count duplicate rows in pandas dataframe where the order of the column values is not importan-CodePudding

I wonder if we can extend the logic of How to count duplicate rows in pandas dataframe?, so that we also consider rows which have similar values on the columns with other rows, but the values are unordered.

Imagine a dataframe like this:

    fruit1  fruit2
0   apple   banana
1   cherry  orange
3   apple   banana
4   banana  apple

we want to produce an output like this:

    fruit1  fruit2  occurences
0   apple   banana  3
1   cherry  orange  1

Is this possible?

CodePudding user response：

You can directly re-assign the np.sort values like so, then use value_counts():

import numpy as np

df.loc[:] = np.sort(df, axis=1)
out = df.value_counts().reset_index(name='occurences')
print(out)

Output:

   fruit1  fruit2  occurences
0   apple  banana           3
1  cherry  orange           1

CodePudding user response：

You could use np.sort along axis=1 to sort the values in your rows.

Then it's just the regular groupby.size():

import numpy as np

fruit_cols = ['fruit1','fruit2']
df_sort = pd.DataFrame(np.sort(df.values,axis=1),columns=fruit_cols)
df_sort.groupby(fruit_cols,as_index=False).size()

prints:

   fruit1  fruit2  size
0   apple  banana     3
1  cherry  orange     1