Home > Software engineering >  How to create a dataframe with aggregated categories?
How to create a dataframe with aggregated categories?

Time:03-14

I have a pandas dataframe (df) with the following fields:

id name category
01 Eddie magician
01 Eddie plumber
02 Martha actress
03 Jeremy dancer
03 Jeremy actor

I want to create a dataframe (df2) like the following:

id name categories
01 Eddie magician, plumber
02 Martha actress
03 Jeremy dancer, actor

So, first of all, i create df2 and add an additional column by the following commands:

df2 = df.groupby("id", as_index= False).count()
df2["categories"] = str()

(this also counts the occurrences of various categories, which is something useful for what I intend to do)

Then, I use this loop:

for i in df2.itertuples():
   for entries in df.itertuples():
      if i.id == entries.id:
         df2["categories"].iloc[i.Index]  = entries.category
      else:
         pass

Using this code, I get the dataframe that I wanted. However, this implementation has several problems:

  1. Doesn't look optimal.
  2. If there are repeated entries (such as another column with "Eddie" and "magician"), the entry for Eddie in df2 would have "magician, plumber, magician" in categories.

Therefore I would like to ask the community: is there a better way to do this?

Also keep in mind that this is my first question on this website!

Thanks in advance!

CodePudding user response:

You can groupby your id and name columns and apply a function to the category one like this:

import pandas as pd

data = {
  'id': ['01', '01', '02', '03', '03'],
  'name': ['Eddie', 'Eddie', 'Martha', 'Jeremy', 'Jeremy'],
  'category': ['magician', 'plumber', 'actress', 'dancer', 'actor']
}

df = pd.DataFrame(data)
df2 = df.groupby(['id', 'name'])['category'].apply(lambda x: ', '.join(x)).reset_index()
df2

Output:

   id    name          category
0  01   Eddie  magician, plumber
1  02  Martha           actress
2  03  Jeremy      dancer, actor
  • Related