Home > Software design >  count lines equal with combinations in Python
count lines equal with combinations in Python

Time:11-15

I have a dataframe like this:

col1 col2 col3 col N
x y z f
y x z f
f none none none
z y x f

I need to count the rows that equal, regardless of their combinations.

It means that, in this case, the output shoud be something like this:

col1 col2 col3 col N freq
x y z f 3
f none none none 1

This bacause, according to the input dataset, there are three rows that have the same sequence (line 1, line 2, and line 4).

I tried to use the function "value_counts", however, according to the documentation, this function count only the unique values.

N.B. The initial dataset contain over 200 column.

Any solution?

Thanks

CodePudding user response:

You can aggregate the columns as a unique object depending on the exact logic (a frozenset, a sorted tuple, etc.), then count the values of perform a groupby:

I would use:

out = df.agg(frozenset, axis=1).value_counts()

# or, if NaN should be ignored
out = df.agg(lambda x: frozenset(x.dropna()), axis=1).value_counts()

NB. if you can have multiple values and what to consider those, use sorted in place of frozenset

Output:

(x, y, f, z)    3
(f)             1
dtype: int64

If you really want a DataFrame with one of the original rows:

g = df.groupby(df.agg(frozenset, axis=1), as_index=False)
out = g.first().assign(freq=g['col1'].count())

Output:

  col1  col2  col3  colN  freq
0    x     y     z     f     3
1    f  None  None  None     1

CodePudding user response:

example:

data = {'col1': {0: 'x', 1: 'y', 2: 'f', 3: 'z'},
        'col2': {0: 'y', 1: 'x', 2: None, 3: 'y'},
        'col3': {0: 'z', 1: 'z', 2: None, 3: 'x'},
        'col N': {0: 'f', 1: 'f', 2: None, 3: 'f'}}
df = pd.DataFrame(data)

output(df):

    col1    col2    col3    col N
0   x       y       z       f
1   y       x       z       f
2   f       None    None    None
3   z       y       x       f



Code:

grouper = df.apply(lambda x: tuple(x.sort_values()), axis=1)
df.groupby(grouper).head(1).assign(count=grouper.value_counts().tolist())

output:

    col1    col2    col3    col N   count
0   x       y       z       f       3
2   f       None    None    None    1
  • Related