I have data as follows:
ID tag_full
001 apple, banana, apple, banana, grape
002 berry, blue, banana, grape
003 melon, apple, grape
So all I want to do is, using Pandas, create the following data frame. This is basically taking every element of the tag_full
variable and making them their own variables.
ID apple banana grape berry blue melon
001 2 2 1 0 0 0
002 0 1 1 1 1 0
003 1 0 1 0 0 1
So all this is is a count of how many times that value (separated by comma) appeared in the string.
CodePudding user response:
Use Counter
from collections
module:
from collections import Counter
out = df.join(pd.DataFrame(df['tag_full'].str.split(', ').apply(Counter).to_list())
.fillna(0).astype(int)).drop(columns=['tag_full'])
print(out)
# Output
ID apple banana grape berry blue melon
0 001 2 2 1 0 0 0
1 002 0 1 1 1 1 0
2 003 1 0 1 0 0 1
Update
With vectorization without apply
:
out = (
df.assign(tag_full=df['tag_full'].str.split(', '), dummy=1).explode('tag_full')
.pivot_table('dummy', 'ID', 'tag_full', aggfunc='sum', fill_value=0)
.rename_axis(columns=None).reset_index()
)
print(out)
# Output
ID apple banana berry blue grape melon
0 001 2 2 0 0 1 0
1 002 0 1 1 1 1 0
2 003 1 0 0 0 1 1
CodePudding user response:
import pandas as pd
newdf=pd.DataFrame({})
for ii in range(len(df)):
list_=df.loc[ii,'tag_full'].split(',')
tags=[]
[tags.append(l1) for l1 in list_ if l1 not in tags]
for tg in tags:
newdf.loc[ii,f'{tg}']=list_.count(tg)
newdf=newdf.fillna(0)
I suppose you know how to read the table into pandas dataframedf
.