Split Australian addresses into street_address, suburb, state and postcode-CodePudding

I have scraped addresses from a webiste but their format is not consistent, for instance:

address = '139 McKinnon Road, PINELANDS, NT, 829'
address_2 = '108 East Point Road, Fannie Bay, NT, 820'
address_3 = '3-11 Hamilton Street, Townsville City, QLD, 4810'

I have tried to split them by space ' ' but couldn't get the desired result.

I have tried:

if "," in address:
            raw_address = address.split(",")
            splitted_address = [
                adr for adr in raw_address if not adr.islower() and not adr.isupper()
            ]
            splitted_suburb = [adr for adr in raw_address if adr.isupper()]
            item["Street_Address"] = splitted_address[0].strip()
            item["Suburb"] = splitted_address[1].strip()
            item["State"] = splitted_suburb[0].strip()
            item["Postcode"] = splitted_address[2].strip()
        else:
            raw_address = address.split(" ")
            splitted_address = [
                adr for adr in raw_address if not adr.islower() and not adr.isupper()
            ]
            splitted_suburb = [adr for adr in raw_address if adr.isupper()]
            item["Street_Address"] = " ".join(splitted_address[:-1])
            item["Suburb"] = splitted_suburb[0]
            item["State"] = splitted_suburb[1]
            item["Postcode"] = splitted_address[-1]

And my desired output should be like this:

Street_Address,Suburb,State,Postcode
Units 1-14, 29 Wiltshire Lane, DELACOMBE, VIC, 3356

How can I split the full address into these specific fields?

Update: I have parsed out the desired fields using regex pattern:

regex_str = "(^.*?(?:Lane|Street|Boulevard|Crescent|Place|Road|Highway|Avenue|Drive|Circuit|Parade|Telopea|Nicklin Way|Terrace|Square|Court|Close|Endeavour Way|Esplanade|East|The Centreway|Mall|Quay|Gateway|Low Way|Point|Rd|Morinda|Way|Ave|St|South Steyne|Broadway|HQ|Expressway|Strett|Castlereagh|Meadow Way|Track|Kulkyne Way|Narabang Way|Bank)),? ?(.*?),? ?([A-Z]{3}),? ?(\d{,4})$"
        matches = re.search(regex_str, full_address)
        street, suburb, state, postcode = matches.groups()
        item["Street_Address"] = street
        item["Suburb"] = suburb
        item["State"] = state
        item["Postcode"] = postcode

It is working for some addresses like with address_3 but with address_1, address_2 this pattern is not working I am getting None Type error:

File "colliers_sale.py", line 164, in parse_details
    street, suburb, state, postcode = matches.groups()
AttributeError: 'NoneType' object has no attribute 'groups'

How can I fix this?

CodePudding user response：

you can use regular expression but probably need multiple pattern, some thing like this:

import re

match = None
if (match := re.search( r'(.*?\d -\d ),? (. ?) ([A-Z ] ) ([A-Z] ) (\d )$', address)):
   pass # this match address, address_3, address_4
elif (match := re.search(r'(\d -\d ) (. ?), (. ?), ([A-Z] ), (\d )$', address)):
   pass # this match address_2
# elif (...another pattern...)

if match:
    print( match[1], match[2], match[3], match[4], match[5], sep=' # ')
else:
    print( 'nothing match')

CodePudding user response：

try 're' package. You can do t using regular expressions like this

import re 

address = 'Units 1-14, 29 Wiltshire Lane DELACOMBE VIC 3356'
address_2 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
address_3 = '6-10 Mount Street MOUNT DRUITT NSW 2770'
address_4 = '34-36 Fairfield Street FAIRFIELD EAST NSW 2165'

addresses = [address, address_2, address_3, address_4]

for add in addresses:
    print(', '.join(re.findall(r"(.*\d -\d )[, ] (\w*\s*\w \s \w )[, ] (\w*\s*\w )[, ] (\w )[, ] (\d )", add)[0]))

parentheses in pattern part of re.findall will help you capture wanted parts.