I have pd.DataFrame
in which one column contains lists
as the values. I want to create another column which consist only the most common value from that column.
Example dataframe:
col_1
0 [1, 2, 3, 3]
1 [2, 2, 8, 8, 7]
2 [3, 4]
And the expected dataframe is
col_1 col_2
0 [1, 2, 3, 3] [3]
1 [2, 2, 8, 8, 7] [2, 8]
2 [3, 4] [3, 4]
I tried to do
from statistics import mode
df['col_1'].apply(lambda x: mode(x))
But it is showing the most common list in that column.
I also tried to use pandas mode
function directly on that column, it also did not help. Is there any way to find the most common value(s)?
CodePudding user response:
Or just use multimode
from the statistics module.
df['col_2'] = df['col_1'].apply(lambda x: multimode(x))
col1 col2
0 [1, 2, 3, 3] [3]
1 [2, 2, 8, 8, 7] [2, 8]
2 [3, 4] [3, 4]
CodePudding user response:
Use Series.mode
- but it is slow:
df['new'] = df['col_1'].apply(lambda x: pd.Series(x).mode().tolist())
print (df)
col_1 new
0 [1, 2, 3, 3] [3]
1 [2, 2, 8, 8, 7] [2, 8]
2 [3, 4] [3, 4]
Or use statistics.multimode
if performance is important:
from statistics import multimode
df['col_2'] = df['col_1'].apply(multimode)
print (df)
col_1 col_2
0 [1, 2, 3, 3] [3]
1 [2, 2, 8, 8, 7] [2, 8]
2 [3, 4] [3, 4]
Performance:
#[3000 rows x 4 columns]
df = pd.concat([df] * 1000, ignore_index=True)
In [195]: %timeit (df['col_1'].explode().groupby(level=0).apply(lambda x: x.mode().tolist()))
537 ms ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [196]: %timeit df['col_1'].apply(lambda x: pd.Series(x).mode().tolist())
699 ms ± 77.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [197]: %timeit df['col_1'].apply(multimode)
13.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
Using mode
per group:
df['col_2'] = (df['col_1']
.explode()
.groupby(level=0)
.apply(lambda x: x.mode().tolist())
)
output:
col_1 col_2
0 [1, 2, 3, 3] [3]
1 [2, 2, 8, 8, 7] [2, 8]
2 [3, 4] [3, 4]
CodePudding user response:
Try this..
from collections import Counter
col_1 = [[1, 2, 3, 3],[2, 2, 8, 8, 7],[3, 4]]
df = pd.DataFrame({'col_1':col_1})
def common(row):
c = Counter(row)
c = pd.Series(c)
return c[c==max(c)].index.values
df['col_2'] = df.col_1.map(common)
df
col_1 col_2
0 [1, 2, 3, 3] [3]
1 [2, 2, 8, 8, 7] [2, 8]
2 [3, 4] [3, 4]