Home > Enterprise >  python pandas normalization
python pandas normalization

Time:06-11

I have a csv file where one field should be normalized over two records:

     ----- --------- 
    | id  | field   |
     ----- --------- 
    | 1   | A-a,B-b |
    | 2   | C-c     |
     ----- --------- 

so some records are comma separated with two tuples to become different records

     ----- --------- 
    | id  | field   |
     ----- --------- 
    | 1   | A-a     |
    | 1   | B-b     |
    | 2   | C-c     |
     ----- --------- 

and then split over two fields

     ----- --------- --------- 
    | id  | field_1 | field_2 |
     ----- --------- --------- 
    | 1   | A       | a       |
    | 1   | B       | b       |
    | 2   | C       | c       |
     ----- --------- --------- 

I have this solution for the last step

df[['field_1', 'field_2']] = df['field'].str.split('-', expand = True)

but I'm missing the first step. Can you help?

CodePudding user response:

df.field = df.field.str.split(',')
df1 = df.explode('field')
df1[['field_1', 'field_2']] = df1.field.str.split('-', expand = True)
df1
 
   id field field_1 field_2
0   1   A-a       A       a
0   1   B-b       B       b
1   2   C-c       C       c

CodePudding user response:

You can achieve this with a single regex and str.extractall:

out = df.join(df['field']
              .str.extractall(r'(?P<field_1>[^-,] )-(?P<field_2>[^-,] )')
              .droplevel(1)
             )

output:

   id    field field_1 field_2
0   1  A-a,B-b       A       a
0   1  A-a,B-b       B       b
1   2      C-c       C       c
  • Related