I have a df
which as following
import pandas as pd
df = pd.DataFrame(
{'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
'number_C2_E2': [None, None, '3', None, None, None, None, None],
'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
'number_C3_E1': [None, None, '3', '4', None, None, None, None],
'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
}
)
Here what I want to do is combine those columns and we have two rules:
- If a column removes
_C{0~9}
or_C{0~9}{0~9}
or_C{0~9}{0~9}{0~9}
is equal to another column, these two columns can be combined.
Let's take
number_C1_E1
number_C2_E2
number_C3_E1
as an example, herenumber_C1_E1
andnumber_C3_E1
can be combined because they are bothnumber_E1
afterremoving _C{0~9}
.
- The two combined columns should get rid of the
None
values.
The desired result is
number_C1_1_E1 fruit_C11_1_E1 name_C111_1_E1 number_C2_1_E2 fruit_C22_1_E2 name_C222_1_E2
0 1 apple tom None None None
1 2 banana jerry None None None
2 3 blueberry anthony 3 blueberry anthony
3 4 strawberry terry None None None
4 5 watermelon paul None None None
5 6 peach edward None None None
6 7 orange reggie None None None
7 8 lemon nicholas None None None
Anyone has a good solution?
CodePudding user response:
Use the same as your previous question, but also compute a renamer for your columns:
group = df.columns.str.replace(r'_C\d ', '', regex=True)
names = df.columns.to_series().groupby(group).first()
out = (df.groupby(group, axis=1, sort=False).first()
.rename(columns=names)
)
Output:
number_C1_E1 fruit_C11_E1 name_C111_E1 number_C2_E2 fruit_C22_E2 name_C222_E2
0 1 apple tom None None None
1 2 banana jerry None None None
2 3 blueberry anthony 3 blueberry anthony
3 4 strawberry terry None None None
4 5 watermelon paul None None None
5 6 peach edward None None None
6 7 orange reggie None None None
7 8 lemon nicholas None None None
CodePudding user response:
This builds on @mozway's idea, and tries to avoid a double groupby:
box = df.columns.str.split(r"C\d ", expand=True)
box = df.groupby(box, axis=1)
columns = [grp.columns[0] for _, grp in box]
frame = box.first()
frame.columns = columns
frame
fruit_C11_E1 fruit_C22_E2 name_C111_E1 name_C222_E2 number_C1_E1 number_C2_E2
0 apple None tom None 1 None
1 banana None jerry None 2 None
2 blueberry blueberry anthony anthony 3 3
3 strawberry None terry None 4 None
4 watermelon None paul None 5 None
5 peach None edward None 6 None
6 orange None reggie None 7 None
7 lemon None nicholas None 8 None