I want to convert a stringcolumn with multiple labels into separate columns for each label and rearrange the dataframe that identical labels are in the same column. For e.g.:
ID | Label |
---|---|
0 | apple, tom, car |
1 | apple, car |
2 | tom, apple |
to
ID | Label | 0 | 1 | 2 |
---|---|---|---|---|
0 | apple, tom, car | apple | car | tom |
1 | apple, car | apple | car | None |
2 | tom, apple | apple | None | tom |
df["Label"].str.split(',',3, expand=True)
0 | 1 | 2 |
---|---|---|
apple | tom | car |
apple | car | None |
tom | apple | None |
I know how to split the stringcolumn, but I can't really figure out how to sort the label columns, especially since the number of labels per sample is different.
CodePudding user response:
The goal of your program is not clear. If you are curious which elements are present in the different rows, then we can just get them all and stack the dataframe like such:
df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)
>>> final_df
0 apple
1 banana
2 grape
3 apple
4 banana
5 banana
6 grape
At this point we can drop the duplicates or count the occurrence of each, depending on your use case.
CodePudding user response:
Try:
df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
df.set_index(["ID", "Label", "Col"])
.unstack(2)
.droplevel(0, axis=1)
.reset_index()
)
df.columns.name = None
print(df)
Prints:
ID Label 0 1 2
0 0 apple, tom, car apple car tom
1 1 apple, car apple car NaN
2 2 tom, apple apple NaN tom