I am trying to extract the state from an address string and some of the addresses are canadian and some american. I think the regex is correct but it is creating an array of shape (29999,29999) and I'm not understanding why:
Here is a sample output of `data['Address']:
19 6349 IN-45, Bloomington, IN 47403
20 ~
21 370 Canyon Meadows Dr SE, Calgary, AB T2J 7C6,...
22 3600 Genesee St, Buffalo, NY 14225
Here is my code:
data['state'] = np.select([data['Address'].str.contains(r',(\s.*\s[0-9])'),data['Address'].str.contains(r',(\s.*\s[A-Za-z][0-9])')],[data['Address'].str.extract(r',(\s.*\s[0-9])'),data['Address'].str.extract(r',(\s.*\s[A-Za-z][0-9])')])
Any help appreciated.
CodePudding user response:
Update
Try:
data['State'] = data['Address'].str.extract(r',\s([^\s,] )\s')
print(data)
# Output
Address State
19 6349 IN-45, Bloomington, IN 47403 IN
20 ~ NaN
21 370 Canyon Meadows Dr SE, Calgary, AB T2J 7C6,... AB
22 3600 Genesee St, Buffalo, NY 14225 NY
Old answer
Is it what you expect:
data['State'] = data['Address'].str.extract(r',(\s.*\s(?:[A-Za-z])?[0-9])')
print(data)
# Output
Address State
19 6349 IN-45, Bloomington, IN 47403 Bloomington, IN 4
20 ~ NaN
21 370 Canyon Meadows Dr SE, Calgary, AB T2J 7C6,... Calgary, AB T2J 7
22 3600 Genesee St, Buffalo, NY 14225 Buffalo, NY 1
I combine your choice list:
r',(\s.*\s[0-9])'
r',(\s.*\s[A-Za-z][0-9])'
into a single expression:
r'(\s.*\s(?:[A-Za-z])?[0-9])'