Home > Software design >  Splitting a column in a data frame over several columns
Splitting a column in a data frame over several columns

Time:07-07

I'm loading a csv file that has two columns: date and tags. tags contains a list of tags like so:

date,tags
2021-09-08,"#foo, #bar"
2021-09-10,"#bar"
2021-09-15,"#bar, #baz"
2021-09-22,"#bar"

loading it with pandas will result in a data frame where all tags are put into one column like so:

        date            tags
0 2021-09-08      #foo, #bar
1 2021-09-10            #bar
2 2021-09-15      #bar, #baz
3 2021-09-22            #bar

So, how do I create from this a data frame, a data frame where each tag is separated into their own column:

        date    foo   bar    baz
0 2021-09-08  True   True  False
1 2021-09-10  False  True  False
2 2021-09-15  False  True   True
3 2021-09-22  False  True  False

CodePudding user response:

Use Series.str.get_dummies with convert 0,1 to boolean and add to date column by DataFrame.join:

df = df[['date']].join(df['tags'].str.get_dummies(', ').astype(bool))
print(df)
         date  #bar   #baz   #foo
0  2021-09-08  True  False   True
1  2021-09-10  True  False  False
2  2021-09-15  True   True  False
3  2021-09-22  True  False  False

If need remove # add rename with custom function:

f = lambda x: x.lstrip('#')
df = df[['date']].join(df['tags'].str.get_dummies(', ').astype(bool).rename(columns=f))
print(df)
         date   bar    baz    foo
0  2021-09-08  True  False   True
1  2021-09-10  True  False  False
2  2021-09-15  True   True  False
3  2021-09-22  True  False  False
  • Related