Home > other >  Pandas: extract different values via regex in a single column
Pandas: extract different values via regex in a single column

Time:10-27

I'm trying to extract some values from 1 column. Some rows have 4 values I want to extract and different notations for that values.

My dataset:

     wow
0    wow-555555 1000110001 DT000333-6403, Hellostreet 45  Town1
1    (wow-666666) - 1000110002; DT000333-6404 Hellostreet 55 Town2
2    wow 777777 DT000333-6405 Hellostreet 56 
3    WOW 888888 1000110004 DT000444-6403, Hellostreet 120  Town2
4    (WoW 999999) 1000110005 DT000555-6403, Hellostreet 121 Town3

I need:

I need 4 new columns with the different values. First value is the 6 numbers (always unique) behind 'wow'. The second is the 10 digit number (always unique). The third is the 'DT' with 10 numbers behind it (always unique). The forth is the address.

I tried this:

df2 = df['wow'].str.extract(r"(?P<number1>\w{3}-\d{6}).*?(?P<number2>\d{10}).*?
(?P<number3>DT\d{6}-\d{4}).*?(?P<address>\w.*|$)")

This gets me the wow-number but only with '-'and not only the 6 digits. I'm also missing the 10 digits number. The DT and address works.

Expected result

     number1    number2       number3        address 
0    555555     1000110001    DT000333-6403  Hellostreet 45  Town1
1    666666     1000110002    DT000333-6404  Hellostreet 55 Town2
2    777777     NaN           DT000333-6405  Hellostreet 56 
3    888888     1000110004    DT000444-6403  Hellostreet 120  Town2
4    999999     1000110005    DT000555-6403  Hellostreet 121 Town3

Appreciate the help and effort. Thank you!

CodePudding user response:

Seems to work with:

regex = r'\w{3}[-\s](?P<number1>\d{6}).*?(?P<number2>\d{10})?\D*?(?P<number3>DT\d{6}-\d{4}).*?(?P<address>\w.*)'

df2 = df['wow'].str.extract(regex)

output:

  number1     number2        number3                 address
0  555555  1000110001  DT000333-6403   Hellostreet 45  Town1
1  666666  1000110002  DT000333-6404    Hellostreet 55 Town2
2  777777         NaN  DT000333-6405          Hellostreet 56
3  888888  1000110004  DT000444-6403  Hellostreet 120  Town2
4  999999  1000110005  DT000555-6403   Hellostreet 121 Town3
  • Related