Home > Enterprise >  Pandas: Merge multiple rows into one row
Pandas: Merge multiple rows into one row

Time:12-06

I have a dataframe like below:

     ID    TYPE      SN      Notes
0    01                      Lorem Ipsum
1    02    apple     aa11    Dummy text
2    02    banana    ab12    Dummy text
3    03    orange    ad04    Random text
4    04                      Latin words
5    05    apple     ac03    Randomised words
6    05    banana    ac04    Randomised words
7    05    orange    aa41    Randomised words
8    05    cherry    af12    Randomised words
9    06    apple     aa32    Dolorem Ipsum

There are rows with same ID and same values (e.g. Notes column) except TYPE and SN columns, which are sometimes empty and sometimes not.

I want to change my existing dataframe to group these rows into one row like below:

     ID   TYPE_1   TYPE_2   TYPE_3   TYPE_4   SN_1   SN_2   SN_3   SN_4   Count   Notes
0    01                                                                   0       Lorem Ipsum
1    02   apple    banana                     aa11   ab12                 2       Dummy text
2    03   orange                              ad04                        1       Random text
3    04                                                                   0       Latin words
4    05   apple    banana   orange   cherry   ac03   ac04   aa41   af12   4       Randomised words
5    06   apple                               aa32                        1       Dolorem Ipsum

I should group the dataframe by ID, but then what? There can be any number of rows with same ID in different dataframes so I can't know maximum number of rows with same ID and create these columns beforehand. How can I achieve this?

CodePudding user response:

Assuming the empty cells are NaN, you can use:

g = df.groupby('ID')
out = (df
 .assign(col=g.cumcount().add(1).astype(str),
         Count=g['TYPE'].transform('count')
        )
 .pivot(index=['ID', 'Notes', 'Count'], columns='col')
 .pipe(lambda d: d.set_axis(d.columns.map('_'.join), axis=1))
 .reset_index()
)

Output:

   ID             Notes  Count  TYPE_1  TYPE_2  TYPE_3  TYPE_4  SN_1  SN_2  SN_3  SN_4
0   1       Lorem Ipsum      0     NaN     NaN     NaN     NaN   NaN   NaN   NaN   NaN
1   2        Dummy text      2   apple  banana     NaN     NaN  aa11  ab12   NaN   NaN
2   3       Random text      1  orange     NaN     NaN     NaN  ad04   NaN   NaN   NaN
3   4       Latin words      0     NaN     NaN     NaN     NaN   NaN   NaN   NaN   NaN
4   5  Randomised words      4   apple  banana  orange  cherry  ac03  ac04  aa41  af12
5   6     Dolorem Ipsum      1   apple     NaN     NaN     NaN  aa32   NaN   NaN   NaN

CodePudding user response:

You can aggregate by '|' and then use str.split to get multiple columns and then concat the final result together:

out = (
    df.groupby("ID")
    .agg(
        TYPE=("TYPE", "|".join),
        SN=("SN", "|".join),
        Notes=("Notes", "first"),
        Count=("ID", "count"),
    )
    .reset_index()
)
types = (
    out["TYPE"]
    .str.split("|", expand=True)
    .fillna("")
    .rename(lambda x: f"TYPE_{x 1}", axis=1)
)
sns = (
    out["SN"]
    .str.split("|", expand=True)
    .fillna("")
    .rename(lambda x: f"SN_{x 1}", axis=1)
)
out = pd.concat([out.drop(["TYPE", "SN"], axis=1), types, sns], axis=1)

print(out):

   ID             Notes  Count  TYPE_1  TYPE_2  TYPE_3  TYPE_4  SN_1  SN_2  \
0  01       Lorem Ipsum      1                                               
1  02        Dummy text      2   apple  banana                  aa11  ab12   
2  03       Random text      1  orange                          ad04         
3  04       Latin words      1                                               
4  05  Randomised words      4   apple  banana  orange  cherry  ac03  ac04   
5  06     Dolorem Ipsum      1   apple                          aa32         

   SN_3  SN_4  
0              
1              
2              
3              
4  aa41  af12  
5     

     
  • Related