I'm trying to extract some values from 1 column. Some rows have 4 values I want to extract and different notations for that values.
My dataset:
wow
0 wow-555555 1000110001 DT000333-6403, Hellostreet 45 Town1
1 (wow-666666) - 1000110002; DT000333-6404 Hellostreet 55 Town2
2 wow 777777 DT000333-6405 Hellostreet 56
3 WOW 888888 1000110004 DT000444-6403, Hellostreet 120 Town2
4 (WoW 999999) 1000110005 DT000555-6403, Hellostreet 121 Town3
I need:
I need 4 new columns with the different values. First value is the 6 numbers (always unique) behind 'wow'. The second is the 10 digit number (always unique). The third is the 'DT' with 10 numbers behind it (always unique). The forth is the address.
I tried this:
df2 = df['wow'].str.extract(r"(?P<number1>\w{3}-\d{6}).*?(?P<number2>\d{10}).*?
(?P<number3>DT\d{6}-\d{4}).*?(?P<address>\w.*|$)")
This gets me the wow-number but only with '-'and not only the 6 digits. I'm also missing the 10 digits number. The DT and address works.
Expected result
number1 number2 number3 address
0 555555 1000110001 DT000333-6403 Hellostreet 45 Town1
1 666666 1000110002 DT000333-6404 Hellostreet 55 Town2
2 777777 NaN DT000333-6405 Hellostreet 56
3 888888 1000110004 DT000444-6403 Hellostreet 120 Town2
4 999999 1000110005 DT000555-6403 Hellostreet 121 Town3
Appreciate the help and effort. Thank you!
CodePudding user response:
Seems to work with:
regex = r'\w{3}[-\s](?P<number1>\d{6}).*?(?P<number2>\d{10})?\D*?(?P<number3>DT\d{6}-\d{4}).*?(?P<address>\w.*)'
df2 = df['wow'].str.extract(regex)
output:
number1 number2 number3 address
0 555555 1000110001 DT000333-6403 Hellostreet 45 Town1
1 666666 1000110002 DT000333-6404 Hellostreet 55 Town2
2 777777 NaN DT000333-6405 Hellostreet 56
3 888888 1000110004 DT000444-6403 Hellostreet 120 Town2
4 999999 1000110005 DT000555-6403 Hellostreet 121 Town3