I am working with a pandas dataframe which has 2 columns, a property ID variable ('listingid') and the corresponding address of the property('address'). I have used an address parser (street-address) to parse each address and the resulting output is a pandas series for each address and is stored under another variable called 'parsed'. Please find below an example of what I mean:-
listingid | address | parsed |
---|---|---|
1 | 570 WOODBOROUGH ROAD NG3 5FH | 'house':'570' 'street name': 'WOODBOROUGH' 'postcode': 'NG3 5FH' |
2 | LEOMINSTER BUSINESS PLC NG3 5FG | 'house': 'None' 'street name':'LEOMINSTER BUSINESS PLC' 'postcode': 'NG3 5FG' |
My question is: can I extract the components listed as 'house','street name', and 'postcode' and store them as three separate variables in the same dataframe? What I would like to end up with is the following:-
listingid | address | hno | street | postcode |
---|---|---|---|---|
1 | 570 WOODBOROUGH ROAD NG3 5FH | 570 | WOODBOROUGH | NG3 5FH |
2 | LEOMINSTER BUSINESS PLC NG3 5FG | None | LEOMINSTER BUSINESS PLC | NG3 5FG |
So essentially each row in my dataframe is a pandas series containing the different parsed elements of a street address and I wish to extract the individual components and convert them to columns in the existing dataframe.
Splitting the address into its constituent words or just using regular expressions is not possible in my case as many addresses are irregularly or uniquely formatted and the address parser I found did a fairly decent job of idenitifying the differenent components of the address.
Very new to Python so apologies if this is a basic question and happy to provide more clarification if required!
CodePudding user response:
Here is a proposition using some classic string transformation and pandas.concat
:
df_parsed = (
df['parsed'].str.replace(' ', '')
.str.replace("''", ',')
.str.replace("'", "")
.str.split(",")
.explode()
.str.split(':', expand=True)
.pivot(columns=0, values=1)
)
out = (
pd.concat([df, df_parsed], axis=1)
.rename(columns={'house': 'hno', 'streetname': 'street'})
.drop(columns='parsed')
)
# Output :
print(out)
listingid address hno postcode street
0 1 570 WOODBOROUGH ROAD NG3 5FH 570 NG35FH WOODBOROUGH
1 2 LEOMINSTER BUSINESS PLC NG3 5FG None NG35FG LEOMINSTERBUSINESSPLC
CodePudding user response:
In case you wanted to employ some regex
:
# strip off the keys before semicolon and add commas between the three values
df['parsed'] = (df['parsed'].str.replace("\'\w \s*\w ?\'\:\s*?", "", regex=True)
.str.replace(" ", "") # remove uneven whitespace
.str.replace("''", ", "))
# split into separate columns with expansion
df[['hno', 'street', 'postcode']] = (df['parsed'].str.strip("'")
.str.split(",", expand=True))
# drop the redundant column
df = df.drop(columns=['parsed'])
print(df)
# Output:
listingid address hno street postcode
0 1 570 WOODBOROUGH ROAD NG3 5FH 570 WOODBOROUGH NG35FH
1 2 LEOMINSTER BUSINESS PLC NG3 5FG None LEOMINSTERBUSINESSPLC NG35FG