I wonder if we can extend the logic of How to count duplicate rows in pandas dataframe?, so that we also consider rows which have similar values on the columns with other rows, but the values are unordered.
Imagine a dataframe like this:
fruit1 fruit2
0 apple banana
1 cherry orange
3 apple banana
4 banana apple
we want to produce an output like this:
fruit1 fruit2 occurences
0 apple banana 3
1 cherry orange 1
Is this possible?
CodePudding user response:
You can directly re-assign the np.sort
values like so, then use value_counts()
:
import numpy as np
df.loc[:] = np.sort(df, axis=1)
out = df.value_counts().reset_index(name='occurences')
print(out)
Output:
fruit1 fruit2 occurences
0 apple banana 3
1 cherry orange 1
CodePudding user response:
You could use np.sort
along axis=1
to sort the values in your rows.
Then it's just the regular groupby.size()
:
import numpy as np
fruit_cols = ['fruit1','fruit2']
df_sort = pd.DataFrame(np.sort(df.values,axis=1),columns=fruit_cols)
df_sort.groupby(fruit_cols,as_index=False).size()
prints:
fruit1 fruit2 size
0 apple banana 3
1 cherry orange 1