Using pandas to create new variables from a comma separated column-CodePudding

I have data as follows:

ID   tag_full
001  apple, banana, apple, banana, grape
002  berry, blue, banana, grape
003  melon, apple, grape

So all I want to do is, using Pandas, create the following data frame. This is basically taking every element of the tag_full variable and making them their own variables.

ID   apple  banana  grape  berry  blue  melon  
001      2       2      1      0     0      0
002      0       1      1      1     1      0
003      1       0      1      0     0      1

So all this is is a count of how many times that value (separated by comma) appeared in the string.

CodePudding user response：

Use Counter from collections module:

from collections import Counter

out = df.join(pd.DataFrame(df['tag_full'].str.split(', ').apply(Counter).to_list())
                .fillna(0).astype(int)).drop(columns=['tag_full'])
print(out)

# Output
    ID  apple  banana  grape  berry  blue  melon
0  001      2       2      1      0     0      0
1  002      0       1      1      1     1      0
2  003      1       0      1      0     0      1

Update

With vectorization without apply:

out = (
  df.assign(tag_full=df['tag_full'].str.split(', '), dummy=1).explode('tag_full')
    .pivot_table('dummy', 'ID', 'tag_full', aggfunc='sum', fill_value=0)
    .rename_axis(columns=None).reset_index()
)
print(out)

# Output
    ID  apple  banana  berry  blue  grape  melon
0  001      2       2      0     0      1      0
1  002      0       1      1     1      1      0
2  003      1       0      0     0      1      1

CodePudding user response：

import pandas as pd

newdf=pd.DataFrame({})
for ii in range(len(df)):
    list_=df.loc[ii,'tag_full'].split(',')
    tags=[]
    [tags.append(l1) for l1 in list_ if l1 not in tags]
    for tg in tags:
        newdf.loc[ii,f'{tg}']=list_.count(tg)
newdf=newdf.fillna(0)

I suppose you know how to read the table into pandas dataframedf.