Converting a dataframe stringcolumn into multiple columns and rearrange each column based on the lab-CodePudding

I want to convert a stringcolumn with multiple labels into separate columns for each label and rearrange the dataframe that identical labels are in the same column. For e.g.:

ID	Label
0	apple, tom, car
1	apple, car
2	tom, apple

ID	Label	0	1	2
0	apple, tom, car	apple	car	tom
1	apple, car	apple	car	None
2	tom, apple	apple	None	tom

df["Label"].str.split(',',3, expand=True)

0	1	2
apple	tom	car
apple	car	None
tom	apple	None

I know how to split the stringcolumn, but I can't really figure out how to sort the label columns, especially since the number of labels per sample is different.

CodePudding user response：

The goal of your program is not clear. If you are curious which elements are present in the different rows, then we can just get them all and stack the dataframe like such:

df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)

>>> final_df
0     apple
1    banana
2     grape
3     apple
4    banana
5    banana
6     grape

At this point we can drop the duplicates or count the occurrence of each, depending on your use case.

CodePudding user response：

Try:

df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
    df.set_index(["ID", "Label", "Col"])
    .unstack(2)
    .droplevel(0, axis=1)
    .reset_index()
)
df.columns.name = None
print(df)

Prints:

   ID            Label      0    1    2
0   0  apple, tom, car  apple  car  tom
1   1       apple, car  apple  car  NaN
2   2       tom, apple  apple  NaN  tom