I have pandas dataframe
import pandas as pd
data = {"Column1": ["258 E SONORA ST SAN",
"57474 SAXONY WAY APT 223 WESLEY",
"62748 CALIFORNIA ST APT 2 SAN",
"3211 LONGLAKE DR FERN",
"420 S PLYMOUTH CT APT 265",
"AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG",
"224-22 141 STREET RICHMOND",
"15624 274TH ST CAMBRIA",
"778 SANTO DOMINGO AVE SW PALM",
"261 BROADMOOR DR SOUTH SIOUX"],
"Colum2" : ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]}
df = pd.DataFrame(data)
df
Output
Column1 Colum2
0 258 E SONORA ST SAN BERNARDINO
1 57474 SAXONY WAY APT 223 WESLEY CHAPEL
2 62748 CALIFORNIA ST APT 2 SAN FRANCISCO
3 3211 LONGLAKE DR FERN CREEK
4 420 S PLYMOUTH CT APT 265 CHICAGO
5 AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG VALLEY
6 224-22 141 STREET RICHMOND HILL
7 15624 274TH ST CAMBRIA HEIGHTS
8 778 SANTO DOMINGO AVE SW PALM BAY
9 261 BROADMOOR DR SOUTH SIOUX CITY
and I have a list
of values at where I need to split the string in column1
split_city = ["ST","DR", "STREET", "AVE SW"]
I also want to include a split after APT
and digit characters.
How can I Split a string of columns based on values in the list and add them to another column in Pandas DataFrame?
Required Output
Column1 Colum2
0 258 E SONORA ST SAN BERNARDINO
1 57474 SAXONY WAY APT 223 WESLEY CHAPEL
2 62748 CALIFORNIA ST APT 2 SAN FRANCISCO
3 3211 LONGLAKE DR FERN CREEK
4 420 S PLYMOUTH CT APT 265 CHICAGO
5 AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG VALLEY
6 224-22 141 STREET RICHMOND HILL
7 15624 274TH ST CAMBRIA HEIGHTS
8 778 SANTO DOMINGO AVE SW PALM BAY
9 261 BROADMOOR DR SOUTH SIOUX CITY
CodePudding user response:
I don't know if there's a nice way in Pandas to do this, but because the edge cases are so numerous here, it'd be easier to combine the addresses and then use regex instead of trying to split on your split choices (while also accounting for apartments):
Solution
import re
pattern = re.compile(r"([-\d\w ]*)\s(ST|WAY|DR|STREET|AVE|N|S|E|W|SW|SE|NW|NE|APT \d*)\s([\w ]*)")
column1 = ["258 E SONORA ST SAN",
"57474 SAXONY WAY APT 223 WESLEY",
"62748 CALIFORNIA ST APT 2 SAN",
"3211 LONGLAKE DR FERN",
"420 S PLYMOUTH CT APT 265",
"AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG",
"224-22 141 STREET RICHMOND",
"15624 274TH ST CAMBRIA",
"778 SANTO DOMINGO AVE SW PALM",
"261 BROADMOOR DR SOUTH SIOUX"]
column2 = ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]
combined = [" ".join(t) for t in zip(column1, column2)]
streets = []
cities = []
for t in (re.findall(pattern, s) for s in combined):
*street, city = t[0]
streets.append(" ".join(street))
cities.append(city)
df = pd.DataFrame({"street": streets, "city": cities})
Output:
In [10]: pd.DataFrame({"street": streets, "city": cities})
Out[10]:
street city
0 258 E SONORA ST SAN BERNARDINO
1 57474 SAXONY WAY APT 223 WESLEY CHAPEL
2 62748 CALIFORNIA ST APT 2 SAN FRANCISCO
3 3211 LONGLAKE DR FERN CREEK
4 420 S PLYMOUTH CT APT 265 CHICAGO
5 AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG VALLEY
6 224-22 141 STREET RICHMOND HILL
7 15624 274TH ST CAMBRIA HEIGHTS
8 778 SANTO DOMINGO AVE SW PALM BAY
9 261 BROADMOOR DR SOUTH SIOUX CITY