Home > Back-end >  Find most common values for each row consist of lists
Find most common values for each row consist of lists

Time:10-12

I have pd.DataFrame in which one column contains lists as the values. I want to create another column which consist only the most common value from that column. Example dataframe:

    col_1
0   [1, 2, 3, 3]
1   [2, 2, 8, 8, 7]
2   [3, 4]

And the expected dataframe is

    col_1           col_2
0   [1, 2, 3, 3]    [3]
1   [2, 2, 8, 8, 7] [2, 8]
2   [3, 4]          [3, 4]

I tried to do

from statistics import mode
df['col_1'].apply(lambda x: mode(x)) 

But it is showing the most common list in that column.

I also tried to use pandas mode function directly on that column, it also did not help. Is there any way to find the most common value(s)?

CodePudding user response:

Or just use multimode from the statistics module.

df['col_2'] = df['col_1'].apply(lambda x: multimode(x))
              col1    col2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

CodePudding user response:

Use Series.mode - but it is slow:

df['new'] = df['col_1'].apply(lambda x: pd.Series(x).mode().tolist()) 
print (df)
             col_1     new
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Or use statistics.multimode if performance is important:

from statistics import multimode

df['col_2'] = df['col_1'].apply(multimode) 
print (df)
             col_1   col_2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Performance:

#[3000 rows x 4 columns]
df = pd.concat([df] * 1000, ignore_index=True)

In [195]: %timeit (df['col_1'].explode().groupby(level=0).apply(lambda x: x.mode().tolist()))
537 ms ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [196]: %timeit df['col_1'].apply(lambda x: pd.Series(x).mode().tolist())
699 ms ± 77.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [197]: %timeit df['col_1'].apply(multimode)
13.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response:

Using mode per group:

df['col_2'] = (df['col_1']
               .explode()
               .groupby(level=0)
               .apply(lambda x: x.mode().tolist())
              )

output:

             col_1   col_2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

CodePudding user response:

Try this..

from collections import Counter

col_1 = [[1, 2, 3, 3],[2, 2, 8, 8, 7],[3, 4]]
df = pd.DataFrame({'col_1':col_1})

def common(row):
    c = Counter(row)
    c = pd.Series(c)
    return c[c==max(c)].index.values

df['col_2'] = df.col_1.map(common)

df

     col_1            col_2
0    [1, 2, 3, 3]     [3]
1    [2, 2, 8, 8, 7]  [2, 8]
2    [3, 4]           [3, 4]
  • Related