Home > Back-end >  Pandas: extracting values via regex in 1 column
Pandas: extracting values via regex in 1 column

Time:10-29

I'm trying to extract some values from 1 column. Some rows have 3 values I want to with different setups.

My dataset:

     col1
0    1001100100 / hello street 2 a town1
1    1001100102 ;hello 3 towns2
2    STRZ19-0072 DT001001-0100 location1 town4
3    1001100103_hello street 3, town5
4    DT001002-0100 street  78
5    1001100107 DT001002-0102 street 6a town7

I need:

I need 3 new columns with different values focused on the 10 digit number starting with "100", the 'DT'number with 6 digits-4 digits and the street town combined. Other values like 'STRZ19-0072' are not relevant.

I tried this:

df2 = df['col1'].str.extract(r"(?<col2>\d{10})|(?<col3>PR\d{6}-\d{4})|(?<col4>\w.*)")

This does not get me the DT-number from row 3.

Expected result:

     col2        col3           address 
0    1001100100  NaN            hello street 2 a town1
1    1001100102  NaN            hello 3 towns2
2    NaN         DT001001-0100  location1 town4
3    1001100103  NaN            hello street 3, town5
4    NaN         DT001002-0100  street  78
5    1001100107  DT001002-0102  street 6a town7

Appreciate the help and effort. Thank you!

CodePudding user response:

I'd do it in 3 separate steps:

df["col2"] = df["col1"].str.extract(r"(100\d{7})")
df["col3"] = df["col1"].str.extract(r"(DT\d{6}-\d{4})")
df["col4"] = df["col1"].str.extract(r"([A-Za-z][A-Za-z\d\s,] )$")
print(df)

Prints:

                                        col1        col2           col3                    col4
0        1001100100 / hello street 2 a town1  1001100100            NaN  hello street 2 a town1
1                 1001100102 ;hello 3 towns2  1001100102            NaN          hello 3 towns2
2  STRZ19-0072 DT001001-0100 location1 town4         NaN  DT001001-0100         location1 town4
3           1001100103_hello street 3, town5  1001100103            NaN   hello street 3, town5
4                    DT001002-0100 street 78         NaN  DT001002-0100               street 78
5   1001100107 DT001002-0102 street 6a town7  1001100107  DT001002-0102         street 6a town7

CodePudding user response:

Instead of using an alternation for all 3 groups, you can using single pattern with the 3 named capture groups after each other:

^(?:(?P<col2>\d{10})|(?![A-Z]{2}\d)\S )?\s?(?P<col3>[A-Z]{2}\d{6}-\d{4})?[\W_]*(?P<col4>\w.*)

The pattern matches:

  • ^ Start of string
  • (?: Non capture group for the alternation
    • (?P<col2>\d{10})|(?![A-Z]{2}\d)\S Capture 10 digits in group col2 or match 1 non whitspace chars if string does not start with 2 digits and a char A-Z
  • )? Close the non capture group
  • \s* Match optional whitespace chars
  • (?P<col3>[A-Z]{2}\d{6}-\d{4})? Optionally capture 2 chars A-Z, 6 digits - and 4 digits in group col3
  • [\W_]* Match optional non word chars and _
  • (?P<col4>\w.*) Capture a word character and match the rest of the line in group col4

Regex demo

import pandas as pd

data = [
    "1001100100 / hello street 2 a town1",
    "1001100102 ;hello 3 towns2",
    "STRZ19-0072 DT001001-0100 location1 town4",
    "1001100103_hello street 3, town5",
    "DT001002-0100 street  78",
    "1001100107 DT001002-0102 street 6a town7"
]

df = pd.DataFrame(data, columns=["col1"])
df2 = pd.concat(
    [
        df['col1'],
        df['col1'].str.extract(r"^(?:(?P<col2>\d{10})|(?![A-Z]{2}\d)\S )?\s*(?P<col3>[A-Z]{2}\d{6}-\d{4})?[\W_]*(?P<col4>\w.*)")
    ], axis=1
)

print(df2)

Output
                                        col1        col2           col3                    col4
0        1001100100 / hello street 2 a town1  1001100100            NaN  hello street 2 a town1
1                 1001100102 ;hello 3 towns2  1001100102            NaN          hello 3 towns2
2  STRZ19-0072 DT001001-0100 location1 town4         NaN  DT001001-0100         location1 town4
3           1001100103_hello street 3, town5  1001100103            NaN   hello street 3, town5
4                   DT001002-0100 street  78         NaN  DT001002-0100              street  78
5   1001100107 DT001002-0102 street 6a town7  1001100107  DT001002-0102         street 6a town7
  • Related