I'm trying to extract some values from 1 column. Some rows have 3 values I want to with different setups.
My dataset:
col1
0 1001100100 / hello street 2 a town1
1 1001100102 ;hello 3 towns2
2 STRZ19-0072 DT001001-0100 location1 town4
3 1001100103_hello street 3, town5
4 DT001002-0100 street 78
5 1001100107 DT001002-0102 street 6a town7
I need:
I need 3 new columns with different values focused on the 10 digit number starting with "100", the 'DT'number with 6 digits-4 digits and the street town combined. Other values like 'STRZ19-0072' are not relevant.
I tried this:
df2 = df['col1'].str.extract(r"(?<col2>\d{10})|(?<col3>PR\d{6}-\d{4})|(?<col4>\w.*)")
This does not get me the DT-number from row 3.
Expected result:
col2 col3 address
0 1001100100 NaN hello street 2 a town1
1 1001100102 NaN hello 3 towns2
2 NaN DT001001-0100 location1 town4
3 1001100103 NaN hello street 3, town5
4 NaN DT001002-0100 street 78
5 1001100107 DT001002-0102 street 6a town7
Appreciate the help and effort. Thank you!
CodePudding user response:
I'd do it in 3 separate steps:
df["col2"] = df["col1"].str.extract(r"(100\d{7})")
df["col3"] = df["col1"].str.extract(r"(DT\d{6}-\d{4})")
df["col4"] = df["col1"].str.extract(r"([A-Za-z][A-Za-z\d\s,] )$")
print(df)
Prints:
col1 col2 col3 col4
0 1001100100 / hello street 2 a town1 1001100100 NaN hello street 2 a town1
1 1001100102 ;hello 3 towns2 1001100102 NaN hello 3 towns2
2 STRZ19-0072 DT001001-0100 location1 town4 NaN DT001001-0100 location1 town4
3 1001100103_hello street 3, town5 1001100103 NaN hello street 3, town5
4 DT001002-0100 street 78 NaN DT001002-0100 street 78
5 1001100107 DT001002-0102 street 6a town7 1001100107 DT001002-0102 street 6a town7
CodePudding user response:
Instead of using an alternation for all 3 groups, you can using single pattern with the 3 named capture groups after each other:
^(?:(?P<col2>\d{10})|(?![A-Z]{2}\d)\S )?\s?(?P<col3>[A-Z]{2}\d{6}-\d{4})?[\W_]*(?P<col4>\w.*)
The pattern matches:
^
Start of string(?:
Non capture group for the alternation(?P<col2>\d{10})|(?![A-Z]{2}\d)\S
Capture 10 digits in group col2 or match 1 non whitspace chars if string does not start with 2 digits and a char A-Z
)?
Close the non capture group\s*
Match optional whitespace chars(?P<col3>[A-Z]{2}\d{6}-\d{4})?
Optionally capture 2 chars A-Z, 6 digits-
and 4 digits in group col3[\W_]*
Match optional non word chars and_
(?P<col4>\w.*)
Capture a word character and match the rest of the line in group col4
import pandas as pd
data = [
"1001100100 / hello street 2 a town1",
"1001100102 ;hello 3 towns2",
"STRZ19-0072 DT001001-0100 location1 town4",
"1001100103_hello street 3, town5",
"DT001002-0100 street 78",
"1001100107 DT001002-0102 street 6a town7"
]
df = pd.DataFrame(data, columns=["col1"])
df2 = pd.concat(
[
df['col1'],
df['col1'].str.extract(r"^(?:(?P<col2>\d{10})|(?![A-Z]{2}\d)\S )?\s*(?P<col3>[A-Z]{2}\d{6}-\d{4})?[\W_]*(?P<col4>\w.*)")
], axis=1
)
print(df2)
Output
col1 col2 col3 col4
0 1001100100 / hello street 2 a town1 1001100100 NaN hello street 2 a town1
1 1001100102 ;hello 3 towns2 1001100102 NaN hello 3 towns2
2 STRZ19-0072 DT001001-0100 location1 town4 NaN DT001001-0100 location1 town4
3 1001100103_hello street 3, town5 1001100103 NaN hello street 3, town5
4 DT001002-0100 street 78 NaN DT001002-0100 street 78
5 1001100107 DT001002-0102 street 6a town7 1001100107 DT001002-0102 street 6a town7