How to replace all row values according to the most frequent value?-CodePudding

Consider the following dataframe:

     a       b        c       d
0   2 PM    6 PM    4 PM    5 AM
1   2 PM    1 PM    10 PM   12 AM
2   10 AM   2 AM    9 AM    2 PM
3   8 AM    9 AM    10 AM   11 AM

How is it possible to identify the most frequent PM/AM value per row and replace the whole row with that?

Output:

     a       b        c       d
0   2 PM    6 PM    4 PM    5 PM
1   2 PM    1 PM    10 PM   12 PM
2   10 AM   2 AM    9 AM    2 AM
3   8 AM    9 AM    10 AM   11 AM

So in the first row we had 3 PMs and 1 AM, so everything becomes a PM and so on.

I've tried something like this to find the most frequent value per row, but I don't know how to successfully replace per row with that value.

df[['a', 'b', 'c', 'd']].applymap(lambda row: row.split(' ')[1]).mode(axis=1)[0]

CodePudding user response：

It is easier to do this in long form (1D) rather than in 2D. We can stack and str.split and expand into a DataFrame, then groupby transform to get the mode per group. Lastly join together and unstack to go back to wide format:

new_df = df.stack().str.split(' ', expand=True)
new_df[1] = new_df.groupby(level=0)[1].transform(lambda s: s.mode()[0])
new_df = new_df.apply(' '.join, axis=1).unstack(level=1)

new_df:

       a     b      c      d
0   2 PM  6 PM   4 PM   5 PM
1   2 PM  1 PM  10 PM  12 PM
2  10 AM  2 AM   9 AM   2 AM
3   8 AM  9 AM  10 AM  11 AM

Steps:

stack str.split to get 2 columns of data

new_df = df.stack().str.split(' ', expand=True)

      0   1
0 a   2  PM
  b   6  PM
  c   4  PM
  d   5  AM
1 a   2  PM
  b   1  PM
  c  10  PM
  d  12  AM
2 a  10  AM
  b   2  AM
  c   9  AM
  d   2  PM
3 a   8  AM
  b   9  AM
  c  10  AM
  d  11  AM

groupby transform relative to level=0 and replace column 1 with the mode for each group

new_df[1] = new_df.groupby(level=0)[1].transform(lambda s: s.mode()[0])

      0   1
0 a   2  PM
  b   6  PM
  c   4  PM
  d   5  AM
1 a   2  PM
  b   1  PM
  c  10  PM
  d  12  AM
2 a  10  AM
  b   2  AM
  c   9  AM
  d   2  PM
3 a   8  AM
  b   9  AM
  c  10  AM
  d  11  AM

join each row back together and unstack to get back to wide format:

new_df = new_df.apply(' '.join, axis=1).unstack(level=1)

       a     b      c      d
0   2 PM  6 PM   4 PM   5 PM
1   2 PM  1 PM  10 PM  12 PM
2  10 AM  2 AM   9 AM   2 AM
3   8 AM  9 AM  10 AM  11 AM

Setup used:

import pandas as pd

df = pd.DataFrame({
    'a': ['2 PM', '2 PM', '10 AM', '8 AM'],
    'b': ['6 PM', '1 PM', '2 AM', '9 AM'],
    'c': ['4 PM', '10 PM', '9 AM', '10 AM'],
    'd': ['5 AM', '12 AM', '2 PM', '11 AM']
})

CodePudding user response：

One liner:

df.apply(lambda x: x.str.extract('^(. ?) (?P<x>. ?)$').assign(x=lambda x: x['x'].mode()[0]).apply(list, axis=1).str.join(' '), axis=1)

Explanation:

# Call the specified function (if axis=0, call the func for each column; if axis=1 (which it does here), call the func for each row)
df.apply(
    # Custom function execute for each row:
    lambda x:
        x
            # Split each item of the row into two columns: one for the number and one for the AM/PM (?P<x> causes the column to be named "x" which is needed for later processing)
            .str.extract('^(. ?) (?P<x>. ?)$')
            # Set all values of the "x" column to the most common value
            .assign(x=lambda x:
                x['x'].mode()[0]
            )
            # str.extract created two columns - now join these two columns back together into one column with lists
            .apply(list, axis=1)
            # Join the lists together back into the original string format
            .str.join(' '),
    
    # Execute the above function for each row (as opposed to axis=0 which would execute for each column)
    axis=1
)