I have some address data like:
Address
Buffalo, NY, 14201
Stackoverflow Street, New York, NY, 9999
I'd like to split these into columns like:
Street City State Zip
NaN Buffalo NY 14201
StackOverflow Street New York NY 99999
Essentially, I'd like to shift my strings over by one in each column in the result.
With Pandas I know I can split columns like:
import pandas as pd
df = pd.DataFrame(
data={'Address': ['Buffalo, NY, 14201', 'Stackoverflow Street, New York, NY, 99999']}
)
df[['Street','City','State','Zip']] = (
df['Address']
.str.split(',', expand=True)
.applymap(lambda col: col.strip() if col else col)
)
but need to figure out how to conditionally shift columns when my result is only 3 columns.
CodePudding user response:
First, create a function to reverse a split for each row. Because if you split normally, the NaN will be at the end, so you reverse the order and split the list now the NaN will be at the end but the list is reversed.
Then, apply it to all rows.
Then, rename the columns because they will be integers.
Finally, set them in the right order.
fn = lambda x: pd.Series([i for i in reversed(x.split(','))])
pad = df['Address'].apply(fn)
pad
looks like this right now,
0 1 2 3
0 14201 NY Buffalo NaN
1 99999 NY New York Stackoverflow Street
Just need to rename the columns and flip the order back.
pad.rename(columns={0:'Zip',1:'State',2:'City',3:'Street'},inplace=True)
df = pad[['Street','City','State','Zip']]
Output:
Street City State Zip
0 NaN Buffalo NY 14201
1 Stackoverflow Street New York NY 99999
CodePudding user response:
Use a bit of numpy magic to reorder the columns with None on the left:
df2 = df['Address'].str.split(',', expand=True)
df[['Street','City','State','Zip']] = df2.to_numpy()[np.arange(len(df))[:,None], np.argsort(df2.notna())]
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 None Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Another idea, add as many commas as needed to have n-1 (here 3) before splitting:
df[['Street','City','State','Zip']] = (
df['Address'].str.count(',')
.rsub(4-1).map(lambda x: ','*x)
.add(df['Address'])
.str.split(',', expand=True)
)
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
CodePudding user response:
Well I found a solution but not sure if there is something more performant out there. Open to other ideas.
def split_shift(s: str) -> list[str]:
split_str: list[str] = s.split(',')
# If split is only 3 items, shift things over by inserting a NA in front
if len(split_str) == 3:
split_str.insert(0,pd.NA)
return split_str
df[['Street','City','State','Zip']] = pd.DataFrame(df['Address'].apply(lambda x: split_shift(x)).tolist())