I have a pandas df of addresses like this:
df['address']
0. ALL that certain piece, parcel or tract of land situate, lying and being in the City
of Travelers Rest, County of Greenville, State of South Carolina
1. Townes Street on the West, in the City of Greenville, County of Greenville, State of
South Carolina
2. State of South Carolina, County of Greenville, City of Hampton on the southern side
I want to extract the name of city
such that expected results:
Travelers Rest
Greenville
Hampton
My code is below:
df['city'] = df['address'].str.extract(r'\b(?:City of?) (. ?(?=[,]))')
My results:
Travelers Rest
Greenville
City of Hampton on the...
However, when the city name doesn't end with a ,
it will pick up the rest of the string. If i don't end my regex in ,
I won't get the full city name in some cases. How can I resolve this?
CodePudding user response:
One option for the example data could be matching the following words starting with a capital A-Z and optional non whitespace chars excluding a comma:
\bCity\s of\s ([A-Z][^\s,] (?:\s [A-Z][^\s,] )*)
data = [
"ALL that certain piece, parcel or tract of land situate, lying and being in the City of Travelers Rest, County of Greenville, State of South Carolina",
"Townes Street on the West, in the City of Greenville, County of Greenville, State of South Carolina",
"State of South Carolina, County of Greenville, City of Hampton on the southern side"
]
df = pd.DataFrame(data, columns=["address"])
df["city"] = df["address"].str.extract(r"\bCity\s of\s ([A-Z][^\s,] (?:\s [A-Z][^\s,] )*)")
print(df)
Output
address city
0 ALL that certain piece, parcel or tract of lan... Travelers Rest
1 Townes Street on the West, in the City of Gree... Greenville
2 State of South Carolina, County of Greenville,... Hampton