Home > Mobile >  More efficient way for creating new dataframe using groupby() in pandas?
More efficient way for creating new dataframe using groupby() in pandas?

Time:08-19

I currently have a m x n dataframe in pandas that has an id column, where there can be multiple rows with the same id. I need to merge all the rows with the same id down into a single row that is longer than n, say length a.

I have been using df.groupby("id").apply(custom_function), where def custom_function(...) takes a dataframe (all rows of the same id) and merges it down into a single row of length a>n. The final output will have no duplicate id values.

However, this has been very inefficient, and I cannot figure out how to make this faster.

Some columns within each id group are constant, while other "target" columns will vary between the rows in the group. custom_function essentially merges rows by leaving the constant column values unchanged, while adding additional columns for the ones that vary, using a duration column value to iterate and create those new columns.

Example. Before:

   id    const_field_1   const_field_2  duration  target_field_1  target_field_2
0  5321  a               b              1min      d1              x1
1  5321  a               b              2min      d2              x2
2  5321  a               b              5min      d3              x3
  ...

After:

   id    const_field_1   const_field_2  target_field_1_1min  target_field_1_2min  target_field_1_5min ...
0  5321  a               b              d1                   d2                   d3                  ...
...

I'm not sure if any of the optimized pandas agg() or transform() functions work here.

CodePudding user response:

You can use pivot_table to accomplish this however you will get duplicates for the const_field columns that need to be removed afterward.

df = df.pivot_table(index='id', 
               columns='duration',
               values=['const_field_1', 'const_field_2', 'target_field_1', 'target_field_2'], 
               aggfunc='first')

Intermediate result:

          const_field_1   const_field_2   target_field_1  target_field_2          
duration  1min 2min 5min  1min 2min 5min  1min 2min 5min  1min 2min 5min
id                                                                                                        
5321         a    a    a     b    b    b    d1   d2   d3    x1   x2   x3

Post-processing for the column names and removing duplicates:

const_cols = ['const_field_1', 'const_field_2']
df.columns = ['_'.join(col) if col[0] not in const_cols else col[0] for col in df.columns.values]
df = df.loc[:, ~df.columns.duplicated()]
df = df.reset_index()

Result:

     id const_field_1 const_field_2 target_field_1_1min target_field_1_2min target_field_1_5min target_field_2_1min target_field_2_2min target_field_2_5min
0  5321             a             b                  d1                  d2                  d3                  x1                  x2                  x3
  • Related