Home > Mobile >  How do I convert each element of a pandas series into a separate column of an existing dataframe (Py
How do I convert each element of a pandas series into a separate column of an existing dataframe (Py

Time:10-05

I am working with a pandas dataframe which has 2 columns, a property ID variable ('listingid') and the corresponding address of the property('address'). I have used an address parser (street-address) to parse each address and the resulting output is a pandas series for each address and is stored under another variable called 'parsed'. Please find below an example of what I mean:-

listingid address parsed
1 570 WOODBOROUGH ROAD NG3 5FH 'house':'570' 'street name': 'WOODBOROUGH' 'postcode': 'NG3 5FH'
2 LEOMINSTER BUSINESS PLC NG3 5FG 'house': 'None' 'street name':'LEOMINSTER BUSINESS PLC' 'postcode': 'NG3 5FG'

My question is: can I extract the components listed as 'house','street name', and 'postcode' and store them as three separate variables in the same dataframe? What I would like to end up with is the following:-

listingid address hno street postcode
1 570 WOODBOROUGH ROAD NG3 5FH 570 WOODBOROUGH NG3 5FH
2 LEOMINSTER BUSINESS PLC NG3 5FG None LEOMINSTER BUSINESS PLC NG3 5FG

So essentially each row in my dataframe is a pandas series containing the different parsed elements of a street address and I wish to extract the individual components and convert them to columns in the existing dataframe.

Splitting the address into its constituent words or just using regular expressions is not possible in my case as many addresses are irregularly or uniquely formatted and the address parser I found did a fairly decent job of idenitifying the differenent components of the address.

Very new to Python so apologies if this is a basic question and happy to provide more clarification if required!

CodePudding user response:

Here is a proposition using some classic string transformation and pandas.concat :

df_parsed = (
                df['parsed'].str.replace(' ', '')
                            .str.replace("''", ',')
                            .str.replace("'", "")
                            .str.split(",")
                            .explode()
                            .str.split(':', expand=True)
                            .pivot(columns=0, values=1)
            )

out = (
        pd.concat([df, df_parsed], axis=1)
          .rename(columns={'house': 'hno', 'streetname': 'street'})
          .drop(columns='parsed')
      )

# Output :

print(out)

   listingid                          address   hno postcode                 street
0          1     570 WOODBOROUGH ROAD NG3 5FH   570   NG35FH            WOODBOROUGH
1          2  LEOMINSTER BUSINESS PLC NG3 5FG  None   NG35FG  LEOMINSTERBUSINESSPLC

CodePudding user response:

In case you wanted to employ some regex:

# strip off the keys before semicolon and add commas between the three values
df['parsed'] = (df['parsed'].str.replace("\'\w \s*\w ?\'\:\s*?", "", regex=True)
                            .str.replace(" ", "")  # remove uneven whitespace
                            .str.replace("''", ", "))

# split into separate columns with expansion
df[['hno', 'street', 'postcode']] = (df['parsed'].str.strip("'")
                                                 .str.split(",", expand=True))

# drop the redundant column
df = df.drop(columns=['parsed'])

print(df)

# Output:
    listingid  address                           hno    street               postcode
0   1          570 WOODBOROUGH ROAD NG3 5FH      570    WOODBOROUGH            NG35FH
1   2          LEOMINSTER BUSINESS PLC NG3 5FG   None   LEOMINSTERBUSINESSPLC  NG35FG
  • Related