I have multiple dataframes and the format is this:
a|b|c label 1003252452 20210929
There may be one column in a df that is delimited by a vertical bar |. I want to split this row into 3 rows, with all other columns remaining the same:
id tag uid date
a label 1003252452 20210929
b label 1003252452 20210929
c label 1003252452 20210929
Also after they are split, there might be duplicate rows due to the split, duplicate rows should be removed. Is there a good way to do this? I am thinking to approach this way:
for df in all_dfs:
dict_rows = []
for index, row in df.iterrows():
new_rows = []
if 'id' in row:
row_dict = dict(row)
tags = row_dict['id'].split('|')
for tag in tags:
new_row = {'id': tag}
del row_dict['id']
...
This may be possible, but very complicated. Is there a better way to do this?
CodePudding user response:
Maybe this is what you want
>>> df
id tag uid date
0 a|b|c label 1003252452 20210929
>>> df.drop('id', axis=1).join(df['id'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('id'))
tag uid date id
0 label 1003252452 20210929 a
0 label 1003252452 20210929 b
0 label 1003252452 20210929 c
CodePudding user response:
Use str.split
explode
drop_duplicates
:
df.assign(id = df.id.str.split('|')).explode('id').drop_duplicates().reset_index(drop=True)
id tag uid date
0 a label 1003252452 20210929
1 b label 1003252452 20210929
2 c label 1003252452 20210929