Home > Net >  How to parse column values into multiple rows in dataframe?
How to parse column values into multiple rows in dataframe?

Time:09-30

I have multiple dataframes and the format is this:

a|b|c   label       1003252452  20210929

There may be one column in a df that is delimited by a vertical bar |. I want to split this row into 3 rows, with all other columns remaining the same:

id  tag         uid         date
a   label       1003252452  20210929
b   label       1003252452  20210929
c   label       1003252452  20210929

Also after they are split, there might be duplicate rows due to the split, duplicate rows should be removed. Is there a good way to do this? I am thinking to approach this way:

for df in all_dfs:
      dict_rows = []
      for index, row in df.iterrows():
        new_rows = []
        if 'id' in row:
           row_dict = dict(row)
           tags = row_dict['id'].split('|')
            for tag in tags:
                 new_row = {'id': tag}
                 del row_dict['id']
                 
                        ...

This may be possible, but very complicated. Is there a better way to do this?

CodePudding user response:

Maybe this is what you want

>>> df
      id    tag         uid      date
0  a|b|c  label  1003252452  20210929
>>> df.drop('id', axis=1).join(df['id'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('id'))
     tag         uid      date id
0  label  1003252452  20210929  a
0  label  1003252452  20210929  b
0  label  1003252452  20210929  c

CodePudding user response:

Use str.split explode drop_duplicates:

df.assign(id = df.id.str.split('|')).explode('id').drop_duplicates().reset_index(drop=True)

  id    tag         uid      date
0  a  label  1003252452  20210929
1  b  label  1003252452  20210929
2  c  label  1003252452  20210929
  • Related