I have a df
as following
import pandas as pd
df = pd.DataFrame(
{'number_1': ['1', '2', None, None, '5', '6', '7', '8'],
'fruit_1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
'name_1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
'number_2': [None, None, '3', None, None, None, None, None],
'fruit_2': [None, None, 'blueberry', None, None, None, None, None],
'name_2': [None, None, 'anthony', None, None, None, None, None],
'number_3': [None, None, '3', '4', None, None, None, None],
'fruit_3': [None, None, 'blueberry', 'strawberry', None, None, None, None],
'name_3': [None, None, 'anthony', 'terry', None, None, None, None],
}
)
Here what I'd like to do is:
- find columns which has the same item.
name_1
,name_2
,name_3
for example. - combine the columns to get rid of the
None
values.
The desired result is
number fruit name
0 1 apple tom
1 2 banana jerry
2 3 blueberry anthony
3 4 strawberry terry
4 5 watermelon paul
5 6 peach edward
6 7 orange reggie
7 8 lemon nicholas
Here is how I do it.
# Get the first column
merge_df = pd.DataFrame(df.iloc[:, 0])
merge_df.columns = [merge_df.columns[0].split('_')[0]]
item_list = [column_list[0].split('_')[0]]
column_list = df.columns.to_list()
for i in range(len(column_list)):
for j in range(i 1, len(column_list)):
first_item = column_list[i].split('_')[0]
second_item = column_list[j].split('_')[0]
# change series name
df_series = df.iloc[:, j]
df_series.name = second_item
if first_item != second_item and second_item not in item_list:
merge_df = pd.concat([merge_df, df_series], axis=1)
item_list.append(column_list[j].split('_')[0])
if first_item == second_item:
# combine df and series
if second_item in merge_df.columns:
merge_df = merge_df.assign(
**{f'{second_item}': merge_df[second_item].combine(df_series,
lambda x, y: x if x is not None else y)})
print(merge_df)
Problem is it is very slow if df
has multiple columns.
Anyone has an advice to optimize this?
Edit:
The accepted answer has given a perfect way to use a regex. Here I had a more complicated issue which is similar to this. I put it here instead of creating a new answer.
Here the df
is
import pandas as pd
df = pd.DataFrame(
{'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
'number_C2_E2': [None, None, '3', None, None, None, None, None],
'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
'number_C3_E1': [None, None, '3', '4', None, None, None, None],
'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
}
)
Here the rule is: if a column removes _C{0~9}
or _C{0~9}{0~9}
or _C{0~9}{0~9}{0~9}
is equal to another column, these two columns can be combined. Let's take number_C1_E1
number_C2_E2
number_C3_E1
as an example, here number_C1_E1
and number_C3_E1
can be combined because they are both number_E1
after removing _C{0~9}
. In this way, the desired result is
number_E1 fruit_E1 name_E1 number_E2 fruit_E2 name_E2
0 1 apple tom None None None
1 2 banana jerry None None None
2 3 blueberry anthony 3 blueberry anthony
3 4 strawberry terry None None None
4 5 watermelon paul None None None
5 6 peach edward None None None
6 7 orange reggie None None None
7 8 lemon nicholas None None None
CodePudding user response:
extract
the first word of the column names with a regex and groupby.first
on columns:
out = df.groupby(df.columns.str.extract('([^_] )', expand=False),
axis=1, sort=False).first()
Output:
number fruit name
0 1 apple tom
1 2 banana jerry
2 3 blueberry anthony
3 4 strawberry terry
4 5 watermelon paul
5 6 peach edward
6 7 orange reggie
7 8 lemon nicholas
Second example: use the same logic with str.replace
to remove the internal part
# remove internal _xxx_
out = df.groupby(df.columns.str.replace(r'_[^_] (?=_)', '', regex=True),
axis=1, sort=False).first()
# remove second to last xxx
out = df.groupby(df.columns.str.replace(r'(_[^_] )(?=_[^_] $)', '', regex=True),
axis=1, sort=False).first()
Output:
number_E1 fruit_E1 name_E1 number_E2 fruit_E2 name_E2
0 1 apple tom None None None
1 2 banana jerry None None None
2 3 blueberry anthony 3 blueberry anthony
3 4 strawberry terry None None None
4 5 watermelon paul None None None
5 6 peach edward None None None
6 7 orange reggie None None None
7 8 lemon nicholas None None None