Have a pandas dataframe like this:
index id | col1 | col2 | col3 | othercolumns ... |
---|---|---|---|---|
4 | apple | orange | nan | |
2 | orange | orange | nan | |
5 | orange | a fruit | nan | |
3 | apple | a fruit | a fruit | |
1 | nan | nan | nan |
To count how many times orange, apple, etc occurs in col1, i can simply use groupby and size(). However, if I wanted to count how many times orange, apple, 'a fruit' etc occur throughout the 3 columns, but make sure that i don't double count anything (like counting orange twice in row 2), how would I do this? is there a way to concatinate the columns and avoid the duplicates?
CodePudding user response:
You can melt
, drop_duplicates
and value_counts
:
(df.melt('index id')
.drop_duplicates(['index id', 'value'])
['value'].value_counts()
)
output:
orange 3
apple 2
a fruit 2
Name: value, dtype: int64
If index id
is the index, you can use stack
:
(df.stack().reset_index(name='value')
.drop_duplicates(['index id', 'value'])
['value'].value_counts()
)
Another approach, with aggregation to set and explode
:
(df.set_index('index id') # only if not the index already
.agg(set, axis=1)
.explode()
.value_counts()
)
CodePudding user response:
A possible solution, which returns 3
. It first checks whether there is at least one orange in each row and then sums up the True
values:
df.apply(lambda x: x.eq('orange').any(), axis=1).sum()
Thanks to @It_is_Chris, this solution can be improved as follows:
(df[df.columns[1:4]] == 'orange').any(axis=1).sum()
In case the frequencies for all fruits are needed, a solution might be the following:
df.iloc[:,1:4].T.agg(pd.unique).explode().value_counts()
Output:
orange 3
apple 2
a fruit 2
dtype: int64