I have a dataframe like this:
col1 | col2 | col3 | col N |
---|---|---|---|
x | y | z | f |
y | x | z | f |
f | none | none | none |
z | y | x | f |
I need to count the rows that equal, regardless of their combinations.
It means that, in this case, the output shoud be something like this:
col1 | col2 | col3 | col N | freq |
---|---|---|---|---|
x | y | z | f | 3 |
f | none | none | none | 1 |
This bacause, according to the input dataset, there are three rows that have the same sequence (line 1, line 2, and line 4).
I tried to use the function "value_counts", however, according to the documentation, this function count only the unique values.
N.B. The initial dataset contain over 200 column.
Any solution?
Thanks
CodePudding user response:
You can aggregate the columns as a unique object depending on the exact logic (a frozenset, a sorted tuple, etc.), then count the values of perform a groupby
:
I would use:
out = df.agg(frozenset, axis=1).value_counts()
# or, if NaN should be ignored
out = df.agg(lambda x: frozenset(x.dropna()), axis=1).value_counts()
NB. if you can have multiple values and what to consider those, use sorted
in place of frozenset
Output:
(x, y, f, z) 3
(f) 1
dtype: int64
If you really want a DataFrame with one of the original rows:
g = df.groupby(df.agg(frozenset, axis=1), as_index=False)
out = g.first().assign(freq=g['col1'].count())
Output:
col1 col2 col3 colN freq
0 x y z f 3
1 f None None None 1
CodePudding user response:
example:
data = {'col1': {0: 'x', 1: 'y', 2: 'f', 3: 'z'},
'col2': {0: 'y', 1: 'x', 2: None, 3: 'y'},
'col3': {0: 'z', 1: 'z', 2: None, 3: 'x'},
'col N': {0: 'f', 1: 'f', 2: None, 3: 'f'}}
df = pd.DataFrame(data)
output(df
):
col1 col2 col3 col N
0 x y z f
1 y x z f
2 f None None None
3 z y x f
Code:
grouper = df.apply(lambda x: tuple(x.sort_values()), axis=1)
df.groupby(grouper).head(1).assign(count=grouper.value_counts().tolist())
output:
col1 col2 col3 col N count
0 x y z f 3
2 f None None None 1