Find most common values for each row consist of lists-CodePudding

I have pd.DataFrame in which one column contains lists as the values. I want to create another column which consist only the most common value from that column. Example dataframe:

    col_1
0   [1, 2, 3, 3]
1   [2, 2, 8, 8, 7]
2   [3, 4]

And the expected dataframe is

    col_1           col_2
0   [1, 2, 3, 3]    [3]
1   [2, 2, 8, 8, 7] [2, 8]
2   [3, 4]          [3, 4]

I tried to do

from statistics import mode
df['col_1'].apply(lambda x: mode(x))

But it is showing the most common list in that column.

I also tried to use pandas mode function directly on that column, it also did not help. Is there any way to find the most common value(s)?

CodePudding user response：

Or just use multimode from the statistics module.

df['col_2'] = df['col_1'].apply(lambda x: multimode(x))

              col1    col2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

CodePudding user response：

Use Series.mode - but it is slow:

df['new'] = df['col_1'].apply(lambda x: pd.Series(x).mode().tolist()) 
print (df)
             col_1     new
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Or use statistics.multimode if performance is important:

from statistics import multimode

df['col_2'] = df['col_1'].apply(multimode) 
print (df)
             col_1   col_2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Performance:

#[3000 rows x 4 columns]
df = pd.concat([df] * 1000, ignore_index=True)

In [195]: %timeit (df['col_1'].explode().groupby(level=0).apply(lambda x: x.mode().tolist()))
537 ms ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [196]: %timeit df['col_1'].apply(lambda x: pd.Series(x).mode().tolist())
699 ms ± 77.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [197]: %timeit df['col_1'].apply(multimode)
13.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

Using mode per group:

df['col_2'] = (df['col_1']
               .explode()
               .groupby(level=0)
               .apply(lambda x: x.mode().tolist())
              )

output:

             col_1   col_2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

CodePudding user response：

Try this..

from collections import Counter

col_1 = [[1, 2, 3, 3],[2, 2, 8, 8, 7],[3, 4]]
df = pd.DataFrame({'col_1':col_1})

def common(row):
    c = Counter(row)
    c = pd.Series(c)
    return c[c==max(c)].index.values

df['col_2'] = df.col_1.map(common)

     col_1            col_2
0    [1, 2, 3, 3]     [3]
1    [2, 2, 8, 8, 7]  [2, 8]
2    [3, 4]           [3, 4]