Concatenate multiple columns into a new column for duplicate rows with pandas-CodePudding

I'm looking for a method to concatenate multiple columns into a new one for duplicate rows with pandas. Here is the code I have so far :

import pandas as pd

df = pd.DataFrame([
  {"col1": 1, "col2": 2, "col3": 3, "col4": 4, "col5": 5},
  {"col1": 1, "col2": 2, "col3": 3, "col4": 6, "col5": 7},
  {"col1": 1, "col2": 2, "col3": 3, "col4": 8, "col5": 8},
  {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 101},
  {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 102},
  {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 100},
])

def f(x, y, z):
  return list({x, y, z})

new_df_rows = []

# !!!
# should merge all rows of same duplicated(col1, col2, col3) into one row with a new field containing the set of all values from "col3", "col4" and "col5"
# the following works but I'm using a lot of hacks
# !!!
df_duplicated = df[df.duplicated(["col1", "col2", "col3"], keep=False)]
df_duplicated_groupby = df_duplicated.groupby(["col1", "col2", "col3"])
group_names = df_duplicated_groupby.groups.keys()
for group_name in group_names:
  group = df_duplicated_groupby.get_group(group_name)
  print(group_name)
  print(group)

  new_col = list({x for l in [f(row[0], row[1], row[2]) for row in group[['col3', "col4",'col5']].to_numpy()] for x in l})
  
  new_df_rows.append({
    "col1": group_name[0],
    "col2": group_name[1],
    "col3": group_name[2],
    "new_col": new_col
  })

new_df = pd.DataFrame(new_df_rows)
print(new_df.to_string())

Expected results :

   col1  col2  col3              new_col
0     1     2     3   [3, 4, 5, 6, 7, 8]
1     1     2    10  [10, 100, 101, 102]

Is there a more concise and/or faster method with pandas to achieve that ?

CodePudding user response：

You can do this with groupby apply and numpy unique:

new_df = (df.groupby([df.col1, df.col2, df.col3])[['col3', 'col4', 'col5']]
          .apply(lambda x: np.unique(x).tolist())
          .rename('new_column')
          .reset_index())

Result:

   col1  col2  col3              new_col
0     1     2     3   [3, 4, 5, 6, 7, 8]
1     1     2    10  [10, 100, 101, 102]

CodePudding user response：

First create the newcol with values as a list:

df['newcol'] = df[['col3','col4','col5']].values.tolist()

   col1  col2  col3  col4  col5          newcol
0     1     2     3     4     5       [3, 4, 5]
1     1     2     3     6     7       [3, 6, 7]
2     1     2     3     8     8       [3, 8, 8]
3     1     2    10   100   101  [10, 100, 101]
4     1     2    10   100   102  [10, 100, 102]
5     1     2    10   100   100  [10, 100, 100]

Then apply groupby on first three columns with sum operation on newcol:

newdf = df.groupby(['col1','col2','col3']).agg({'newcol': 'sum'}).reset_index()

Remove duplicates:

newdf['newcol'] = newdf['newcol'].apply(lambda x:list(set(x)))

Output:

   col1  col2  col3               newcol
0     1     2     3   [3, 4, 5, 6, 7, 8]
1     1     2    10  [10, 100, 101, 102]