I have a pandas dataframe (df) with the following fields:
id | name | category |
---|---|---|
01 | Eddie | magician |
01 | Eddie | plumber |
02 | Martha | actress |
03 | Jeremy | dancer |
03 | Jeremy | actor |
I want to create a dataframe (df2) like the following:
id | name | categories |
---|---|---|
01 | Eddie | magician, plumber |
02 | Martha | actress |
03 | Jeremy | dancer, actor |
So, first of all, i create df2 and add an additional column by the following commands:
df2 = df.groupby("id", as_index= False).count()
df2["categories"] = str()
(this also counts the occurrences of various categories, which is something useful for what I intend to do)
Then, I use this loop:
for i in df2.itertuples():
for entries in df.itertuples():
if i.id == entries.id:
df2["categories"].iloc[i.Index] = entries.category
else:
pass
Using this code, I get the dataframe that I wanted. However, this implementation has several problems:
- Doesn't look optimal.
- If there are repeated entries (such as another column with "Eddie" and "magician"), the entry for Eddie in df2 would have "magician, plumber, magician" in categories.
Therefore I would like to ask the community: is there a better way to do this?
Also keep in mind that this is my first question on this website!
Thanks in advance!
CodePudding user response:
You can groupby your id
and name
columns and apply a function to the category
one like this:
import pandas as pd
data = {
'id': ['01', '01', '02', '03', '03'],
'name': ['Eddie', 'Eddie', 'Martha', 'Jeremy', 'Jeremy'],
'category': ['magician', 'plumber', 'actress', 'dancer', 'actor']
}
df = pd.DataFrame(data)
df2 = df.groupby(['id', 'name'])['category'].apply(lambda x: ', '.join(x)).reset_index()
df2
Output:
id name category
0 01 Eddie magician, plumber
1 02 Martha actress
2 03 Jeremy dancer, actor