Pandas Split 1 Column into Multiple Columns where Delimited Split size Can Vary-CodePudding

I have some address data like:

Address
Buffalo, NY, 14201
Stackoverflow Street, New York, NY, 9999

I'd like to split these into columns like:

              Street      City   State     Zip
NaN                    Buffalo      NY   14201
StackOverflow Street  New York      NY   99999

Essentially, I'd like to shift my strings over by one in each column in the result.

With Pandas I know I can split columns like:

import pandas as pd
df = pd.DataFrame(
    data={'Address': ['Buffalo, NY, 14201', 'Stackoverflow Street, New York, NY, 99999']}
)

df[['Street','City','State','Zip']] = (
    df['Address']
    .str.split(',', expand=True)
    .applymap(lambda col: col.strip() if col else col)
)

but need to figure out how to conditionally shift columns when my result is only 3 columns.

CodePudding user response：

First, create a function to reverse a split for each row. Because if you split normally, the NaN will be at the end, so you reverse the order and split the list now the NaN will be at the end but the list is reversed.

Then, apply it to all rows.

Then, rename the columns because they will be integers.

Finally, set them in the right order.

fn = lambda x: pd.Series([i for i in reversed(x.split(','))])
pad = df['Address'].apply(fn)

pad looks like this right now,

    0       1   2           3
0   14201   NY  Buffalo     NaN
1   99999   NY  New York    Stackoverflow Street

Just need to rename the columns and flip the order back.

pad.rename(columns={0:'Zip',1:'State',2:'City',3:'Street'},inplace=True)
df = pad[['Street','City','State','Zip']]

Output:

    Street                  City        State   Zip
0   NaN                     Buffalo     NY      14201
1   Stackoverflow  Street   New York    NY      99999

CodePudding user response：

Use a bit of numpy magic to reorder the columns with None on the left:

df2 = df['Address'].str.split(',', expand=True)

df[['Street','City','State','Zip']] = df2.to_numpy()[np.arange(len(df))[:,None], np.argsort(df2.notna())]

Output:

                                     Address                Street       City State     Zip
0                         Buffalo, NY, 14201                  None    Buffalo    NY   14201
1  Stackoverflow Street, New York, NY, 99999  Stackoverflow Street   New York    NY   99999

Another idea, add as many commas as needed to have n-1 (here 3) before splitting:

df[['Street','City','State','Zip']] = (
 df['Address'].str.count(',')
 .rsub(4-1).map(lambda x: ','*x)
 .add(df['Address'])
 .str.split(',', expand=True)
 )

Output:

                                     Address                Street       City State     Zip
0                         Buffalo, NY, 14201                          Buffalo    NY   14201
1  Stackoverflow Street, New York, NY, 99999  Stackoverflow Street   New York    NY   99999

CodePudding user response：

Well I found a solution but not sure if there is something more performant out there. Open to other ideas.

def split_shift(s: str) -> list[str]:
    split_str: list[str] = s.split(',')
    # If split is only 3 items, shift things over by inserting a NA in front
    if len(split_str) == 3:
        split_str.insert(0,pd.NA)
    return split_str

df[['Street','City','State','Zip']] = pd.DataFrame(df['Address'].apply(lambda x: split_shift(x)).tolist())