I have a csv file where one field should be normalized over two records:
----- ---------
| id | field |
----- ---------
| 1 | A-a,B-b |
| 2 | C-c |
----- ---------
so some records are comma separated with two tuples to become different records
----- ---------
| id | field |
----- ---------
| 1 | A-a |
| 1 | B-b |
| 2 | C-c |
----- ---------
and then split over two fields
----- --------- ---------
| id | field_1 | field_2 |
----- --------- ---------
| 1 | A | a |
| 1 | B | b |
| 2 | C | c |
----- --------- ---------
I have this solution for the last step
df[['field_1', 'field_2']] = df['field'].str.split('-', expand = True)
but I'm missing the first step. Can you help?
CodePudding user response:
df.field = df.field.str.split(',')
df1 = df.explode('field')
df1[['field_1', 'field_2']] = df1.field.str.split('-', expand = True)
df1
id field field_1 field_2
0 1 A-a A a
0 1 B-b B b
1 2 C-c C c
CodePudding user response:
You can achieve this with a single regex and str.extractall
:
out = df.join(df['field']
.str.extractall(r'(?P<field_1>[^-,] )-(?P<field_2>[^-,] )')
.droplevel(1)
)
output:
id field field_1 field_2
0 1 A-a,B-b A a
0 1 A-a,B-b B b
1 2 C-c C c