I have a df as
name category dummy
USA fx,ft,fe 1
INDIA fx 13
I need to convert this as
name category_fx categoty_ft category_fe dummy
USA True True True 1
INDIA True False False 13
tried with series.explode() function but not getting this output.
CodePudding user response:
Use Series.str.get_dummies
by column category
with converting 0,1
to boolean by DataFrame.astype
and DataFrame.add_prefix
:
c = df.columns.difference(['category'], sort=False).tolist()
df = (df.set_index(c)['category']
.str.get_dummies(',')
.astype(bool)
.add_prefix('category_')
.reset_index())
print (df)
name category_fe category_ft category_fx
0 USA True True True
1 INDIA False False True
EDIT: If need replace one column by multiple columns you can use:
df1 = (df['category']
.str.get_dummies(',')
.astype(bool)
.add_prefix('category_'))
pos = df.columns.get_loc('category')
df = pd.concat([df.iloc[:, :pos], df1, df.iloc[:, pos 1:]], axis=1)
print (df)
name category_fe category_ft category_fx dummy
0 USA True True True 1
1 INDIA False False True 13
This solution is modifid for multiple columns:
print (df)
name category dummy category1
0 USA fx,ft,fe 1 a,f
1 INDIA fx 13 s,a
cols = ['category','category1']
dfs = [(df[c].str.get_dummies(',').astype(bool).add_prefix(f'{c}_')) for c in cols]
df = pd.concat([df, *dfs], axis=1).drop(cols, axis=1)
print (df)
name dummy category_fe category_ft category_fx category1_a \
0 USA 1 True True True True
1 INDIA 13 False False True True
category1_f category1_s
0 True False
1 False True
CodePudding user response:
You can use str.get_dummies
and astype(bool)
to convert your strings to new columns of booleans, then add_prefix
to change the column names, and finally join
:
df2 = (df.drop(columns='category)
.join(df['category']
.str.get_dummies(sep=',')
.astype(bool)
.add_prefix('category_')
)
)
or, for modification of the original dataframe:
df = df.join(df.pop('category')
.str.get_dummies(sep=',')
.astype(bool)
.add_prefix('category_'))
output:
name category_fe category_ft category_fx
0 USA True True True
1 INDIA False False True
generalization to more columns
assuming this input:
name category1 category2 dummy
0 USA fx,ft,fe a,b,c 1
1 INDIA fx d 13
cats = df.filter(like='category').columns
cols = list(df.columns.difference(cats))
(df
.set_index(cols)
.stack()
.str.get_dummies(sep=',')
.groupby(level=cols).max().astype(bool)
.reset_index()
)
output:
dummy name a b c d fe ft fx
0 1 USA True True True False True True True
1 13 INDIA False False False True False False True