How to Concatenate Strings from Using GroupBy in big data frames-CodePudding

I have a data frame like this

import pandas as pd

#create DataFrame
df = pd.DataFrame({'store': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                   'quarter': [1, 1, 2, 2, 1, 1, 2, 2],
                   'employee': ['Andy', 'Bob', 'Chad', 'Diane',
                                'Elana', 'Frank', 'George', 'Hank']})

I want to reduce repeated rows by concatinating values in employee column. Only way I think I can do that is like this

#group by store and quarter, then concatenate employee strings
df.groupby(['store', 'quarter'], as_index=False).agg({'employee': ' '.join})

    store   quarter employee
0   A   1   Andy Bob
1   A   2   Chad Diane
2   B   1   Elana Frank
3   B   2   George Hank

This is a minimal reproducible data, but my real data frame have a lot of columns, do I need to add all column names after groupby or is there another way to do this?

CodePudding user response：

You can do this without putting column names also.

Take below df for example:

In [1011]: df
Out[1011]: 
  store  quarter employee col1
0     A        1     Andy  abc
1     A        1      Bob  abc
2     A        2     Chad  abc
3     A        2    Diane  abc
4     B        1    Elana  abc
5     B        1    Frank  abc
6     B        2   George  abc
7     B        2     Hank  abc

Use:

In [1012]: df = df.groupby(['store', 'quarter'], as_index=False).agg(' '.join)

In [1013]: df
Out[1013]: 
  store  quarter     employee     col1
0     A        1     Andy Bob  abc abc
1     A        2   Chad Diane  abc abc
2     B        1  Elana Frank  abc abc
3     B        2  George Hank  abc abc

This will run agg on the remaining columns except the ones mentioned in groupby.

CodePudding user response：

This will give you the answer you desire

df = pd.DataFrame({'store': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                   'quarter': [1, 1, 2, 2, 1, 1, 2, 2],
                   'employee': ['Andy', 'Bob', 'Chad', 'Diane',
                                'Elana', 'Frank', 'George', 'Hank']})
df = df.groupby(['store', 'quarter'])['employee'].apply(list).agg(' '.join).reset_index(name='new')
df