Home > Enterprise >  How to extract zip code from address using pandas function extract()?
How to extract zip code from address using pandas function extract()?

Time:05-21

I need to extract the ZIP code (only the zip code) into a new column for further analysis. I am mostly using pandas within my data cleaning phase. I trying to use this code before:

import pandas as pd
df_participant = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqthon-participants.csv')

df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d )')

print (df_participant[['address','postal_code']].head())

but it did not work

this is the output: enter image description here

Any help would be very much appreciated! Thank you

CodePudding user response:

You can use .str.findall method to find all the numbers in address field and then get the last value as Zip code.

Here is an example:

Data:

   customer                            address
0    shovon       1234 56th St, Bham, AL 35222
1     arsho           4th Ave, Dever, NY 25699
2  arshovon  1245 apt 9 69th St, Rio, FL 54444
3    rahman         this address has no number

Code:

import pandas as pd

data = {
    "customer": [
        "shovon", "arsho", "arshovon", "rahman"
    ],
    "address": [
        "1234 56th St, Bham, AL 35222",
        "4th Ave, Dever, NY 25699",
        "1245 apt 9 69th St, Rio, FL 54444",
        "this address has no number"
    ]
}

df = pd.DataFrame(data)    
df['postal_code'] = df['address'].str.findall(r'([0-9]\d )').apply(
    lambda x: x[-1] if len(x) >= 1 else '')
print(df)

Output:

   customer                            address postal_code
0    shovon       1234 56th St, Bham, AL 35222       35222
1     arsho           4th Ave, Dever, NY 25699       25699
2  arshovon  1245 apt 9 69th St, Rio, FL 54444       54444
3    rahman         this address has no number            

Explanation:

This will search for each number group in the address field and set the last number as zip code. If there is no number in the address field, it will set an empty string as zip code.

References:

CodePudding user response:

With str.extract

df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d{5})')

#OR if the length of the postal code changes, just make it \d  combined with "$"

df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d )$')

but you don't need it here. Just take the last 5 digits of the string, since the postal code is always at the end.

df_participant['postal_code'] = df_participant['address'].str[-5:]

CodePudding user response:

If all postal code are 5 character long then following might help.

Change

df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d )')

TO

df_participant['postal_code'] = df_participant['address'].str[-5::]

Output received at my end

                                             address postal_code
0           Gg. Monginsidi No. 08\nMedan, Aceh 80734       80734
1     Gg. Rajawali Timur No. 7\nPrabumulih, MA 09434       09434
2             Jalan Kebonjati No. 0\nAmbon, SS 57739       57739
3    Jl. Yos Sudarso No. 109\nLubuklinggau, SR 76156       76156
  • Related