I need to extract the ZIP code (only the zip code) into a new column for further analysis. I am mostly using pandas within my data cleaning phase. I trying to use this code before:
import pandas as pd
df_participant = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqthon-participants.csv')
df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d )')
print (df_participant[['address','postal_code']].head())
but it did not work
Any help would be very much appreciated! Thank you
CodePudding user response:
You can use .str.findall
method to find all the numbers in address field and then get the last value as Zip code.
Here is an example:
Data:
customer address
0 shovon 1234 56th St, Bham, AL 35222
1 arsho 4th Ave, Dever, NY 25699
2 arshovon 1245 apt 9 69th St, Rio, FL 54444
3 rahman this address has no number
Code:
import pandas as pd
data = {
"customer": [
"shovon", "arsho", "arshovon", "rahman"
],
"address": [
"1234 56th St, Bham, AL 35222",
"4th Ave, Dever, NY 25699",
"1245 apt 9 69th St, Rio, FL 54444",
"this address has no number"
]
}
df = pd.DataFrame(data)
df['postal_code'] = df['address'].str.findall(r'([0-9]\d )').apply(
lambda x: x[-1] if len(x) >= 1 else '')
print(df)
Output:
customer address postal_code
0 shovon 1234 56th St, Bham, AL 35222 35222
1 arsho 4th Ave, Dever, NY 25699 25699
2 arshovon 1245 apt 9 69th St, Rio, FL 54444 54444
3 rahman this address has no number
Explanation:
This will search for each number group in the address field and set the last number as zip code. If there is no number in the address field, it will set an empty string as zip code.
References:
CodePudding user response:
With str.extract
df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d{5})')
#OR if the length of the postal code changes, just make it \d combined with "$"
df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d )$')
but you don't need it here. Just take the last 5 digits of the string, since the postal code is always at the end.
df_participant['postal_code'] = df_participant['address'].str[-5:]
CodePudding user response:
If all postal code are 5 character long then following might help.
Change
df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d )')
TO
df_participant['postal_code'] = df_participant['address'].str[-5::]
Output received at my end
address postal_code
0 Gg. Monginsidi No. 08\nMedan, Aceh 80734 80734
1 Gg. Rajawali Timur No. 7\nPrabumulih, MA 09434 09434
2 Jalan Kebonjati No. 0\nAmbon, SS 57739 57739
3 Jl. Yos Sudarso No. 109\nLubuklinggau, SR 76156 76156