Home > database >  Converting a dataframe stringcolumn into multiple columns and rearrange each column based on the lab
Converting a dataframe stringcolumn into multiple columns and rearrange each column based on the lab

Time:04-16

I want to convert a stringcolumn with multiple labels into separate columns for each label and rearrange the dataframe that identical labels are in the same column. For e.g.:

ID Label
0 apple, tom, car
1 apple, car
2 tom, apple

to

ID Label 0 1 2
0 apple, tom, car apple car tom
1 apple, car apple car None
2 tom, apple apple None tom
df["Label"].str.split(',',3, expand=True)
0 1 2
apple tom car
apple car None
tom apple None

I know how to split the stringcolumn, but I can't really figure out how to sort the label columns, especially since the number of labels per sample is different.

CodePudding user response:

The goal of your program is not clear. If you are curious which elements are present in the different rows, then we can just get them all and stack the dataframe like such:

df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)
>>> final_df
0     apple
1    banana
2     grape
3     apple
4    banana
5    banana
6     grape

At this point we can drop the duplicates or count the occurrence of each, depending on your use case.

CodePudding user response:

Try:

df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
    df.set_index(["ID", "Label", "Col"])
    .unstack(2)
    .droplevel(0, axis=1)
    .reset_index()
)
df.columns.name = None
print(df)

Prints:

   ID            Label      0    1    2
0   0  apple, tom, car  apple  car  tom
1   1       apple, car  apple  car  NaN
2   2       tom, apple  apple  NaN  tom
  • Related