I'm looking for a method to concatenate multiple columns into a new one for duplicate rows with pandas. Here is the code I have so far :
import pandas as pd
df = pd.DataFrame([
{"col1": 1, "col2": 2, "col3": 3, "col4": 4, "col5": 5},
{"col1": 1, "col2": 2, "col3": 3, "col4": 6, "col5": 7},
{"col1": 1, "col2": 2, "col3": 3, "col4": 8, "col5": 8},
{"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 101},
{"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 102},
{"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 100},
])
def f(x, y, z):
return list({x, y, z})
new_df_rows = []
# !!!
# should merge all rows of same duplicated(col1, col2, col3) into one row with a new field containing the set of all values from "col3", "col4" and "col5"
# the following works but I'm using a lot of hacks
# !!!
df_duplicated = df[df.duplicated(["col1", "col2", "col3"], keep=False)]
df_duplicated_groupby = df_duplicated.groupby(["col1", "col2", "col3"])
group_names = df_duplicated_groupby.groups.keys()
for group_name in group_names:
group = df_duplicated_groupby.get_group(group_name)
print(group_name)
print(group)
new_col = list({x for l in [f(row[0], row[1], row[2]) for row in group[['col3', "col4",'col5']].to_numpy()] for x in l})
new_df_rows.append({
"col1": group_name[0],
"col2": group_name[1],
"col3": group_name[2],
"new_col": new_col
})
new_df = pd.DataFrame(new_df_rows)
print(new_df.to_string())
Expected results :
col1 col2 col3 new_col
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]
Is there a more concise and/or faster method with pandas to achieve that ?
CodePudding user response:
You can do this with groupby apply
and numpy unique
:
new_df = (df.groupby([df.col1, df.col2, df.col3])[['col3', 'col4', 'col5']]
.apply(lambda x: np.unique(x).tolist())
.rename('new_column')
.reset_index())
Result:
col1 col2 col3 new_col
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]
CodePudding user response:
First create the newcol with values as a list:
df['newcol'] = df[['col3','col4','col5']].values.tolist()
col1 col2 col3 col4 col5 newcol
0 1 2 3 4 5 [3, 4, 5]
1 1 2 3 6 7 [3, 6, 7]
2 1 2 3 8 8 [3, 8, 8]
3 1 2 10 100 101 [10, 100, 101]
4 1 2 10 100 102 [10, 100, 102]
5 1 2 10 100 100 [10, 100, 100]
Then apply groupby on first three columns with sum operation on newcol:
newdf = df.groupby(['col1','col2','col3']).agg({'newcol': 'sum'}).reset_index()
Remove duplicates:
newdf['newcol'] = newdf['newcol'].apply(lambda x:list(set(x)))
Output:
col1 col2 col3 newcol
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]