How to merge multiple columns of a dataframe using regex？-CodePudding

I have a df which as following

import pandas as pd

df = pd.DataFrame(
    {'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
     'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
     'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
     'number_C2_E2': [None, None, '3', None, None, None, None, None],
     'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
     'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
     'number_C3_E1': [None, None, '3', '4', None, None, None, None],
     'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
     'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
     }
)

Here what I want to do is combine those columns and we have two rules:

If a column removes _C{0~9} or _C{0~9}{0~9} or _C{0~9}{0~9}{0~9} is equal to another column, these two columns can be combined.

Let's take number_C1_E1 number_C2_E2 number_C3_E1 as an example, here number_C1_E1 and number_C3_E1 can be combined because they are both number_E1 after removing _C{0~9}.

The two combined columns should get rid of the None values.

The desired result is

  number_C1_1_E1 fruit_C11_1_E1 name_C111_1_E1 number_C2_1_E2 fruit_C22_1_E2 name_C222_1_E2
0              1          apple            tom           None           None           None
1              2         banana          jerry           None           None           None
2              3      blueberry        anthony              3      blueberry        anthony
3              4     strawberry          terry           None           None           None
4              5     watermelon           paul           None           None           None
5              6          peach         edward           None           None           None
6              7         orange         reggie           None           None           None
7              8          lemon       nicholas           None           None           None

Anyone has a good solution?

CodePudding user response：

Use the same as your previous question, but also compute a renamer for your columns:

group = df.columns.str.replace(r'_C\d ', '', regex=True)

names = df.columns.to_series().groupby(group).first()

out = (df.groupby(group, axis=1, sort=False).first()
         .rename(columns=names)
       )

Output:

  number_C1_E1 fruit_C11_E1 name_C111_E1 number_C2_E2 fruit_C22_E2 name_C222_E2
0            1        apple          tom         None         None         None
1            2       banana        jerry         None         None         None
2            3    blueberry      anthony            3    blueberry      anthony
3            4   strawberry        terry         None         None         None
4            5   watermelon         paul         None         None         None
5            6        peach       edward         None         None         None
6            7       orange       reggie         None         None         None
7            8        lemon     nicholas         None         None         None

CodePudding user response：

This builds on @mozway's idea, and tries to avoid a double groupby:

box = df.columns.str.split(r"C\d ", expand=True)
box = df.groupby(box, axis=1)
columns = [grp.columns[0] for _, grp in box]
frame = box.first()
frame.columns = columns
frame
  fruit_C11_E1 fruit_C22_E2 name_C111_E1 name_C222_E2 number_C1_E1 number_C2_E2
0        apple         None          tom         None            1         None
1       banana         None        jerry         None            2         None
2    blueberry    blueberry      anthony      anthony            3            3
3   strawberry         None        terry         None            4         None
4   watermelon         None         paul         None            5         None
5        peach         None       edward         None            6         None
6       orange         None       reggie         None            7         None
7        lemon         None     nicholas         None            8         None