Pandas Dataframe, want to count values in 3 different colums, accounting for duplicate values-CodePudding

Have a pandas dataframe like this:

index id	col1	col2	col3
4	apple	orange	nan
2	orange	orange	nan
5	orange	a fruit	nan
3	apple	a fruit	a fruit
1	nan	nan	nan

To count how many times orange, apple, etc occurs in col1, i can simply use groupby and size(). However, if I wanted to count how many times orange, apple, 'a fruit' etc occur throughout the 3 columns, but make sure that i don't double count anything (like counting orange twice in row 2), how would I do this? is there a way to concatinate the columns and avoid the duplicates?

CodePudding user response：

You can melt, drop_duplicates and value_counts:

(df.melt('index id')
   .drop_duplicates(['index id', 'value'])
   ['value'].value_counts()
)

output:

orange     3
apple      2
a fruit    2
Name: value, dtype: int64

If index id is the index, you can use stack:

(df.stack().reset_index(name='value')
   .drop_duplicates(['index id', 'value'])
   ['value'].value_counts()
)

Another approach, with aggregation to set and explode:

(df.set_index('index id') # only if not the index already
   .agg(set, axis=1)
   .explode()
   .value_counts()
)

CodePudding user response：

A possible solution, which returns 3. It first checks whether there is at least one orange in each row and then sums up the True values:

df.apply(lambda x: x.eq('orange').any(), axis=1).sum()

Thanks to @It_is_Chris, this solution can be improved as follows:

(df[df.columns[1:4]] == 'orange').any(axis=1).sum()

In case the frequencies for all fruits are needed, a solution might be the following:

df.iloc[:,1:4].T.agg(pd.unique).explode().value_counts()

Output:

orange     3
apple      2
a fruit    2
dtype: int64