Home > other >  Pandas Dataframe, want to count values in 3 different colums, accounting for duplicate values
Pandas Dataframe, want to count values in 3 different colums, accounting for duplicate values

Time:10-05

Have a pandas dataframe like this:

index id col1 col2 col3 othercolumns ...
4 apple orange nan
2 orange orange nan
5 orange a fruit nan
3 apple a fruit a fruit
1 nan nan nan

To count how many times orange, apple, etc occurs in col1, i can simply use groupby and size(). However, if I wanted to count how many times orange, apple, 'a fruit' etc occur throughout the 3 columns, but make sure that i don't double count anything (like counting orange twice in row 2), how would I do this? is there a way to concatinate the columns and avoid the duplicates?

CodePudding user response:

You can melt, drop_duplicates and value_counts:

(df.melt('index id')
   .drop_duplicates(['index id', 'value'])
   ['value'].value_counts()
)

output:

orange     3
apple      2
a fruit    2
Name: value, dtype: int64

If index id is the index, you can use stack:

(df.stack().reset_index(name='value')
   .drop_duplicates(['index id', 'value'])
   ['value'].value_counts()
)

Another approach, with aggregation to set and explode:

(df.set_index('index id') # only if not the index already
   .agg(set, axis=1)
   .explode()
   .value_counts()
)

CodePudding user response:

A possible solution, which returns 3. It first checks whether there is at least one orange in each row and then sums up the True values:

df.apply(lambda x: x.eq('orange').any(), axis=1).sum()

Thanks to @It_is_Chris, this solution can be improved as follows:

(df[df.columns[1:4]] == 'orange').any(axis=1).sum()

In case the frequencies for all fruits are needed, a solution might be the following:

df.iloc[:,1:4].T.agg(pd.unique).explode().value_counts()

Output:

orange     3
apple      2
a fruit    2
dtype: int64
  • Related