Consider the following dataframe:
a b c d
0 2 PM 6 PM 4 PM 5 AM
1 2 PM 1 PM 10 PM 12 AM
2 10 AM 2 AM 9 AM 2 PM
3 8 AM 9 AM 10 AM 11 AM
How is it possible to identify the most frequent PM/AM value per row and replace the whole row with that?
Output:
a b c d
0 2 PM 6 PM 4 PM 5 PM
1 2 PM 1 PM 10 PM 12 PM
2 10 AM 2 AM 9 AM 2 AM
3 8 AM 9 AM 10 AM 11 AM
So in the first row we had 3 PMs and 1 AM, so everything becomes a PM and so on.
I've tried something like this to find the most frequent value per row, but I don't know how to successfully replace per row with that value.
df[['a', 'b', 'c', 'd']].applymap(lambda row: row.split(' ')[1]).mode(axis=1)[0]
CodePudding user response:
It is easier to do this in long form (1D) rather than in 2D. We can stack
and str.split
and expand into a DataFrame, then groupby transform
to get the mode
per group. Lastly join
together and unstack
to go back to wide format:
new_df = df.stack().str.split(' ', expand=True)
new_df[1] = new_df.groupby(level=0)[1].transform(lambda s: s.mode()[0])
new_df = new_df.apply(' '.join, axis=1).unstack(level=1)
new_df
:
a b c d
0 2 PM 6 PM 4 PM 5 PM
1 2 PM 1 PM 10 PM 12 PM
2 10 AM 2 AM 9 AM 2 AM
3 8 AM 9 AM 10 AM 11 AM
Steps:
new_df = df.stack().str.split(' ', expand=True)
0 1
0 a 2 PM
b 6 PM
c 4 PM
d 5 AM
1 a 2 PM
b 1 PM
c 10 PM
d 12 AM
2 a 10 AM
b 2 AM
c 9 AM
d 2 PM
3 a 8 AM
b 9 AM
c 10 AM
d 11 AM
groupby transform
relative tolevel=0
and replace column 1 with themode
for each group
new_df[1] = new_df.groupby(level=0)[1].transform(lambda s: s.mode()[0])
0 1
0 a 2 PM
b 6 PM
c 4 PM
d 5 AM
1 a 2 PM
b 1 PM
c 10 PM
d 12 AM
2 a 10 AM
b 2 AM
c 9 AM
d 2 PM
3 a 8 AM
b 9 AM
c 10 AM
d 11 AM
join
each row back together andunstack
to get back to wide format:
new_df = new_df.apply(' '.join, axis=1).unstack(level=1)
a b c d
0 2 PM 6 PM 4 PM 5 PM
1 2 PM 1 PM 10 PM 12 PM
2 10 AM 2 AM 9 AM 2 AM
3 8 AM 9 AM 10 AM 11 AM
Setup used:
import pandas as pd
df = pd.DataFrame({
'a': ['2 PM', '2 PM', '10 AM', '8 AM'],
'b': ['6 PM', '1 PM', '2 AM', '9 AM'],
'c': ['4 PM', '10 PM', '9 AM', '10 AM'],
'd': ['5 AM', '12 AM', '2 PM', '11 AM']
})
CodePudding user response:
One liner:
df.apply(lambda x: x.str.extract('^(. ?) (?P<x>. ?)$').assign(x=lambda x: x['x'].mode()[0]).apply(list, axis=1).str.join(' '), axis=1)
Explanation:
# Call the specified function (if axis=0, call the func for each column; if axis=1 (which it does here), call the func for each row)
df.apply(
# Custom function execute for each row:
lambda x:
x
# Split each item of the row into two columns: one for the number and one for the AM/PM (?P<x> causes the column to be named "x" which is needed for later processing)
.str.extract('^(. ?) (?P<x>. ?)$')
# Set all values of the "x" column to the most common value
.assign(x=lambda x:
x['x'].mode()[0]
)
# str.extract created two columns - now join these two columns back together into one column with lists
.apply(list, axis=1)
# Join the lists together back into the original string format
.str.join(' '),
# Execute the above function for each row (as opposed to axis=0 which would execute for each column)
axis=1
)