I currently have a m
x n
dataframe in pandas that has an id
column, where there can be multiple rows with the same id
. I need to merge all the rows with the same id
down into a single row that is longer than n
, say length a
.
I have been using df.groupby("id").apply(custom_function)
, where def custom_function(...)
takes a dataframe (all rows of the same id
) and merges it down into a single row of length a>n
. The final output will have no duplicate id
values.
However, this has been very inefficient, and I cannot figure out how to make this faster.
Some columns within each id
group are constant, while other "target" columns will vary between the rows in the group. custom_function
essentially merges rows by leaving the constant column values unchanged, while adding additional columns for the ones that vary, using a duration
column value to iterate and create those new columns.
Example. Before:
id const_field_1 const_field_2 duration target_field_1 target_field_2
0 5321 a b 1min d1 x1
1 5321 a b 2min d2 x2
2 5321 a b 5min d3 x3
...
After:
id const_field_1 const_field_2 target_field_1_1min target_field_1_2min target_field_1_5min ...
0 5321 a b d1 d2 d3 ...
...
I'm not sure if any of the optimized pandas agg()
or transform()
functions work here.
CodePudding user response:
You can use pivot_table
to accomplish this however you will get duplicates for the const_field
columns that need to be removed afterward.
df = df.pivot_table(index='id',
columns='duration',
values=['const_field_1', 'const_field_2', 'target_field_1', 'target_field_2'],
aggfunc='first')
Intermediate result:
const_field_1 const_field_2 target_field_1 target_field_2
duration 1min 2min 5min 1min 2min 5min 1min 2min 5min 1min 2min 5min
id
5321 a a a b b b d1 d2 d3 x1 x2 x3
Post-processing for the column names and removing duplicates:
const_cols = ['const_field_1', 'const_field_2']
df.columns = ['_'.join(col) if col[0] not in const_cols else col[0] for col in df.columns.values]
df = df.loc[:, ~df.columns.duplicated()]
df = df.reset_index()
Result:
id const_field_1 const_field_2 target_field_1_1min target_field_1_2min target_field_1_5min target_field_2_1min target_field_2_2min target_field_2_5min
0 5321 a b d1 d2 d3 x1 x2 x3