How to create a dataframe with aggregated categories?-CodePudding

I have a pandas dataframe (df) with the following fields:

id	name	category
01	Eddie	magician
01	Eddie	plumber
02	Martha	actress
03	Jeremy	dancer
03	Jeremy	actor

I want to create a dataframe (df2) like the following:

id	name	categories
01	Eddie	magician, plumber
02	Martha	actress
03	Jeremy	dancer, actor

So, first of all, i create df2 and add an additional column by the following commands:

df2 = df.groupby("id", as_index= False).count()
df2["categories"] = str()

(this also counts the occurrences of various categories, which is something useful for what I intend to do)

Then, I use this loop:

for i in df2.itertuples():
   for entries in df.itertuples():
      if i.id == entries.id:
         df2["categories"].iloc[i.Index]  = entries.category
      else:
         pass

Using this code, I get the dataframe that I wanted. However, this implementation has several problems:

Doesn't look optimal.
If there are repeated entries (such as another column with "Eddie" and "magician"), the entry for Eddie in df2 would have "magician, plumber, magician" in categories.

Therefore I would like to ask the community: is there a better way to do this?

Also keep in mind that this is my first question on this website!

Thanks in advance!

CodePudding user response：

You can groupby your id and name columns and apply a function to the category one like this:

import pandas as pd

data = {
  'id': ['01', '01', '02', '03', '03'],
  'name': ['Eddie', 'Eddie', 'Martha', 'Jeremy', 'Jeremy'],
  'category': ['magician', 'plumber', 'actress', 'dancer', 'actor']
}

df = pd.DataFrame(data)
df2 = df.groupby(['id', 'name'])['category'].apply(lambda x: ', '.join(x)).reset_index()
df2

Output:

   id    name          category
0  01   Eddie  magician, plumber
1  02  Martha           actress
2  03  Jeremy      dancer, actor