Pandas create column categories from rows of data-CodePudding

I have a dataframe like this,

id   Name
100  Apple
200  Orange
100  Banana

I want to create new columns based on the name, and mark them with 0 or 1.

Like this,

id  Name_Apple  Name_Orange  Name_Banana
100  1           0              1
200  0           1              0

Id 100 is marked 1 for apple and banana in the name and 0 for orange. Id 200 marked only 1 for orange. I am not actually sure how to do this. Appreciate any help.

Thanks.

CodePudding user response：

Version 1: If you want to mark the entries with only either 0 or 1:

Use pd.get_dummies() GroupBy.max(), as follows:

df_out = (pd.get_dummies(df)
            .groupby('id').max()
            .reset_index()
         )

Result:

print(df_out)

    id  Name_Apple  Name_Banana  Name_Orange
0  100           1            1            0
1  200           0            0            1

Version 2: If you want to mark the entries as count of occurrences (can be > 1)

If you want the entries to reflect the count of occurrences (can be multiple occurences and thus > 1), you can use pd.crosstab(), as follows:

pd.crosstab(df['id'], df['Name'])

Data Input

    id    Name
0  100   Apple
1  200  Orange
2  100  Banana         <=== multiple occurrences 
3  100  Banana         <=== multiple occurrences

Result

Name  Apple  Banana  Orange
id                         
100       1       2       0
200       0       0       1

Note that Banana for id==100 has a count of 2 instead of either 0 or 1.

You can also use pd.get_dummies(), as follows:

df_out = (pd.get_dummies(df)
            .groupby('id').sum()
            .reset_index()
         )

Here, we use GroupBy.sum() instead of GroupBy.max()

Result:

print(df_out)

    id  Name_Apple  Name_Banana  Name_Orange
0  100           1            2            0
1  200           0            0            1