I have a pandas dataset like this:
import pandas as pd
data = {'id': ['001', '002', '003','004'],
'address': ["William J. Clare\\n290 Valley Dr.\\nCasper, WY 82604\\nUSA",
"1180 Shelard Tower\\nMinneapolis, MN 55426\\nUSA",
"William N. Barnard\\n145 S. Durbin\\nCasper, WY 82601\\nUSA",
"215 S 11th ST"],
'locality': [None, None, None,'Laramie'],
'region': [None, None, None, 'WY'],
'Zipcode': [None, None, None, '87656'],
'Country': [None, None, None, 'US']
}
df = pd.DataFrame(data)
I tried to split the address column by new line but however since it has two \ followed by n. I am not able to do . Please help me in splitting the \n from address and exptrapolate into locality region zipcode and country.
sample output:
id address locality region Zipcode Country
1 290 Valley Dr. Casper WY 82604 USA
2 1180 Shelard Tower Minneapolis MN 55426 USA
3 145 S. Durbin Casper WY 82601 USA
4 215 S 11th ST Laramie WY 87656 US
I tried different methods to split \n using split command but it gives me extra \. And I am trying to keep it in pandas dataframe so that I can carry further analysis. I a new to python any help will be greatly appreciated.
Thanks
CodePudding user response:
Here is an approach using extract
instead of split and update
in place:
df.update(df['address'].str.extract(r'([^,] )(?:,\s(\w )\s*(\d )\\n(\w ))?$')
.set_axis(["address", "region", "Zipcode", "Country"], axis=1)
)
df['name'] = pd.NA
df.update(df['address'].str.extract(r'(?:(.*?)\\n)?(.*)\\n(. )')
.set_axis(['name', 'address', 'locality'], axis=1)
)
output:
id address locality region Zipcode Country name
0 001 290 Valley Dr. Casper WY 82604 USA William J. Clare
1 002 1180 Shelard Tower Minneapolis MN 55426 USA <NA>
2 003 145 S. Durbin Casper WY 82601 USA William N. Barnard
3 004 215 S 11th ST Laramie WY 87656 US <NA>
CodePudding user response:
Line two has no name so it may cause some trouble. Please check if your example is correct.
Given your df like this:
data = {'id': ['001', '002', '003','004'],
'address': ["William J. Clare\\n290 Valley Dr.\\nCasper, WY 82604\\nUSA",
"name\\n1180 Shelard Tower\\nMinneapolis, MN 55426\\nUSA",
"William N. Barnard\\n145 S. Durbin\\nCasper, WY 82601\\nUSA",
"215 S 11th ST"],
'locality': [None, None, None,'Laramie'],
'region': [None, None, None, 'WY'],
'Zipcode': [None, None, None, '87656'],
'Country': [None, None, None, 'US']
}
df = pd.DataFrame(data)
You can create a filter condition:
cond = df[["locality", "region", "Zipcode", "Country"]].isna()
and then use split
with expand=True
df.loc[cond.all(axis=1)].address.str.split(r"\\n", expand=True)
0 William J. Clare 290 Valley Dr. Casper, WY 82604 USA
1 name 1180 Shelard Tower Minneapolis, MN 55426 USA
2 William N. Barnard 145 S. Durbin Casper, WY 82601 USA