Home > Back-end >  How to merge dataframe faster?
How to merge dataframe faster?

Time:02-02

I have a df as following

import pandas as pd

df = pd.DataFrame(
    {'number_1': ['1', '2', None, None, '5', '6', '7', '8'],
     'fruit_1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
     'name_1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
     'number_2': [None, None, '3', None, None, None, None, None],
     'fruit_2': [None, None, 'blueberry', None, None, None, None, None],
     'name_2': [None, None, 'anthony', None, None, None, None, None],
     'number_3': [None, None, '3', '4', None, None, None, None],
     'fruit_3': [None, None, 'blueberry', 'strawberry', None, None, None, None],
     'name_3': [None, None, 'anthony', 'terry', None, None, None, None],
     }
)

Here what I'd like to do is:

  1. find columns which has the same item. name_1, name_2, name_3 for example.
  2. combine the columns to get rid of the None values.

The desired result is

  number       fruit      name
0      1       apple       tom
1      2      banana     jerry
2      3   blueberry   anthony
3      4  strawberry     terry
4      5  watermelon      paul
5      6       peach    edward
6      7      orange    reggie
7      8       lemon  nicholas

Here is how I do it.

# Get the first column
merge_df = pd.DataFrame(df.iloc[:, 0])
merge_df.columns = [merge_df.columns[0].split('_')[0]]
item_list = [column_list[0].split('_')[0]]

column_list = df.columns.to_list()

for i in range(len(column_list)):
    for j in range(i   1, len(column_list)):
        first_item = column_list[i].split('_')[0]
        second_item = column_list[j].split('_')[0]
        # change series name
        df_series = df.iloc[:, j]
        df_series.name = second_item
        if first_item != second_item and second_item not in item_list:
            merge_df = pd.concat([merge_df, df_series], axis=1)
            item_list.append(column_list[j].split('_')[0])
        if first_item == second_item:
            # combine df and series
            if second_item in merge_df.columns:
                merge_df = merge_df.assign(
                    **{f'{second_item}': merge_df[second_item].combine(df_series,
                                                                       lambda x, y: x if x is not None else y)})

print(merge_df)

Problem is it is very slow if df has multiple columns.

Anyone has an advice to optimize this?


Edit:

The accepted answer has given a perfect way to use a regex. Here I had a more complicated issue which is similar to this. I put it here instead of creating a new answer.

Here the df is

import pandas as pd

df = pd.DataFrame(
    {'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
     'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
     'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
     'number_C2_E2': [None, None, '3', None, None, None, None, None],
     'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
     'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
     'number_C3_E1': [None, None, '3', '4', None, None, None, None],
     'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
     'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
     }
)

Here the rule is: if a column removes _C{0~9} or _C{0~9}{0~9} or _C{0~9}{0~9}{0~9} is equal to another column, these two columns can be combined. Let's take number_C1_E1 number_C2_E2 number_C3_E1 as an example, here number_C1_E1 and number_C3_E1 can be combined because they are both number_E1 after removing _C{0~9}. In this way, the desired result is

  number_E1    fruit_E1   name_E1 number_E2   fruit_E2  name_E2
0         1       apple       tom      None       None     None
1         2      banana     jerry      None       None     None
2         3   blueberry   anthony         3  blueberry  anthony
3         4  strawberry     terry      None       None     None
4         5  watermelon      paul      None       None     None
5         6       peach    edward      None       None     None
6         7      orange    reggie      None       None     None
7         8       lemon  nicholas      None       None     None

CodePudding user response:

extract the first word of the column names with a regex and groupby.first on columns:

out = df.groupby(df.columns.str.extract('([^_] )', expand=False),
                 axis=1, sort=False).first()

Output:

  number       fruit      name
0      1       apple       tom
1      2      banana     jerry
2      3   blueberry   anthony
3      4  strawberry     terry
4      5  watermelon      paul
5      6       peach    edward
6      7      orange    reggie
7      8       lemon  nicholas

Second example: use the same logic with str.replace to remove the internal part

# remove internal _xxx_
out = df.groupby(df.columns.str.replace(r'_[^_] (?=_)', '', regex=True),
                 axis=1, sort=False).first()

# remove second to last xxx
out = df.groupby(df.columns.str.replace(r'(_[^_] )(?=_[^_] $)', '', regex=True),
                 axis=1, sort=False).first()

Output:

  number_E1    fruit_E1   name_E1 number_E2   fruit_E2  name_E2
0         1       apple       tom      None       None     None
1         2      banana     jerry      None       None     None
2         3   blueberry   anthony         3  blueberry  anthony
3         4  strawberry     terry      None       None     None
4         5  watermelon      paul      None       None     None
5         6       peach    edward      None       None     None
6         7      orange    reggie      None       None     None
7         8       lemon  nicholas      None       None     None
  • Related