I have scraped addresses from a webiste but their format is not consistent, for instance:
address = '139 McKinnon Road, PINELANDS, NT, 829'
address_2 = '108 East Point Road, Fannie Bay, NT, 820'
address_3 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
I have tried to split them by space ' '
but couldn't get the desired result.
I have tried:
if "," in address:
raw_address = address.split(",")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = splitted_address[0].strip()
item["Suburb"] = splitted_address[1].strip()
item["State"] = splitted_suburb[0].strip()
item["Postcode"] = splitted_address[2].strip()
else:
raw_address = address.split(" ")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = " ".join(splitted_address[:-1])
item["Suburb"] = splitted_suburb[0]
item["State"] = splitted_suburb[1]
item["Postcode"] = splitted_address[-1]
And my desired output should be like this:
Street_Address,Suburb,State,Postcode
Units 1-14, 29 Wiltshire Lane, DELACOMBE, VIC, 3356
How can I split the full address into these specific fields?
Update: I have parsed out the desired fields using regex pattern:
regex_str = "(^.*?(?:Lane|Street|Boulevard|Crescent|Place|Road|Highway|Avenue|Drive|Circuit|Parade|Telopea|Nicklin Way|Terrace|Square|Court|Close|Endeavour Way|Esplanade|East|The Centreway|Mall|Quay|Gateway|Low Way|Point|Rd|Morinda|Way|Ave|St|South Steyne|Broadway|HQ|Expressway|Strett|Castlereagh|Meadow Way|Track|Kulkyne Way|Narabang Way|Bank)),? ?(.*?),? ?([A-Z]{3}),? ?(\d{,4})$"
matches = re.search(regex_str, full_address)
street, suburb, state, postcode = matches.groups()
item["Street_Address"] = street
item["Suburb"] = suburb
item["State"] = state
item["Postcode"] = postcode
It is working for some addresses like with address_3 but with address_1, address_2 this pattern is not working I am getting None Type error:
File "colliers_sale.py", line 164, in parse_details
street, suburb, state, postcode = matches.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
How can I fix this?
CodePudding user response:
you can use regular expression
but probably need multiple pattern, some thing like this:
import re
match = None
if (match := re.search( r'(.*?\d -\d ),? (. ?) ([A-Z ] ) ([A-Z] ) (\d )$', address)):
pass # this match address, address_3, address_4
elif (match := re.search(r'(\d -\d ) (. ?), (. ?), ([A-Z] ), (\d )$', address)):
pass # this match address_2
# elif (...another pattern...)
if match:
print( match[1], match[2], match[3], match[4], match[5], sep=' # ')
else:
print( 'nothing match')
CodePudding user response:
try 're' package. You can do t using regular expressions like this
import re
address = 'Units 1-14, 29 Wiltshire Lane DELACOMBE VIC 3356'
address_2 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
address_3 = '6-10 Mount Street MOUNT DRUITT NSW 2770'
address_4 = '34-36 Fairfield Street FAIRFIELD EAST NSW 2165'
addresses = [address, address_2, address_3, address_4]
for add in addresses:
print(', '.join(re.findall(r"(.*\d -\d )[, ] (\w*\s*\w \s \w )[, ] (\w*\s*\w )[, ] (\w )[, ] (\d )", add)[0]))
parentheses in pattern part of re.findall will help you capture wanted parts.