Home > Enterprise >  Aggregate and concatenate multiple columns
Aggregate and concatenate multiple columns

Time:12-01

I want to groupby my dataframe and concatenate the values/strings from the other columns together.

   Year Letter  Number  Note   Text
0  2022      a       1     8     hi
1  2022      b       1     7  hello
2  2022      a       1     6    bye
3  2022      b       3     5    joe

To this:

             Column
Year Letter              
2022 a            1|8|hi; 1|6|bye
     b            1|7|hello; 3|5|joe

I tried some things with groupby, apply() and agg() but I can't get it work:

df.groupby(['Year', 'Letter']).agg(lambda x: '|'.join(x))

Output:

                  Text
Year Letter           
2022 a          hi|bye
     b       hello|joe

CodePudding user response:

You can first join values per rows converted to strings by DataFrame.astype and DataFrame.agg and then aggregate join in GroupBy.agg:

df1 = (df.assign(Text= df[['Number','Note','Text']].astype(str).agg('|'.join, axis=1))
         .groupby(['Year', 'Letter'])['Text']
         .agg('; '.join)
         .to_frame())
print (df1)
                           Text
Year Letter                    
2022 a          1|8|hi; 1|6|bye
     b       1|7|hello; 3|5|joe

Or create custom lambda function in GroupBy.apply:

f = lambda x:  '; '.join('|'.join(y) for y in x.astype(str).to_numpy())
df1 = (df.groupby(['Year', 'Letter'])[['Number','Note','Text']].apply(f)
         .to_frame(name='Text')
         )
print (df1)
                           Text
Year Letter                    
2022 a          1|8|hi; 1|6|bye
     b       1|7|hello; 3|5|joe

If need join all columns without grouping columns:

grouped = ['Year','Letter']

df1 = (df.assign(Text= df[df.columns.difference(grouped, sort=False)]
                .astype(str).agg('|'.join, axis=1))
         .groupby(['Year', 'Letter'])['Text']
         .agg('; '.join)
         .to_frame())

grouped = ['Year','Letter']

f = lambda x:  '; '.join('|'.join(y) for y in x.astype(str).to_numpy())
df1 = (df.groupby(grouped)[df.columns.difference(grouped, sort=False)].apply(f)
        .to_frame(name='Text')
         )

CodePudding user response:

Using groupby.apply:

cols = ['Year', 'Letter']
(df.groupby(cols)
   .apply(lambda d: '; '.join(d.drop(columns=cols) # or slice the columns here
                               .astype(str)
                               .agg('|'.join, axis=1)))
   .to_frame(name='Column')
)

Output:

                         Column
Year Letter                     
2022 a          1|8|hi; 1|6|bye
     b       1|7|hello; 3|5|joe
  • Related