I have many addresses information, such as:
123 1st Ave Apt501, Flushing, New York, 00000, USA
234 West 20th Street 1A, New York, New York, 11111, USA
345 North 100st Street Apt. 110, New York, New York, 22222, USA
I would like to get the street information. So, I am wondering how can I delete the apartment information after "Ave", and "Street"?
So, the addresses will be cleaned as:
123 1st Ave, Flushing, New York, 00000, USA
234 West 20th Street, New York, New York, 11111, USA
345 North 100st Street, New York, New York, 22222, USA
Or the data can be cleaned as:
123 1st Ave
234 West 20th Street
345 North 100st Street
This is the code I tried. However, I was not able to remove apartment information not including "apt".
conditions = [df.address.str.contains('Apt')]
choices = [df.address.apply(lambda x: x[x.find('Apt'):])]
df['apt'] = np.select(conditions, choices, default = '')
choices2 = [df.address.apply(lambda x: x[:x.find('Apt')])]
df['address'] = np.select(conditions, choices2, default = df.address)
CodePudding user response:
I think you should wrap all the addresses in a list and use a split to separate each element in the address so you can access street information by index 0.
addresses = ['123 1st Ave, Flushing, New York, 00000, USA', '234 West 20th Street, New York, New York, 11111, USA',
'345 North 100st Street, New York, New York, 22222, USA']
for s in addresses:
print(s.split(',')[0])
Output
123 1st Ave
234 West 20th Street
345 North 100st Street
CodePudding user response:
To get the second option, I'd split at comma first and then process the first item with a regular expression.
df['street'] = (df.address
.str.split(',') # split at ,
.str[0] # get the first element
.str.replace('(Apt[.\s]*|Street\s )\d \w?$',
'')
)
The regular expression matches
Apt
followed by zero or more dots or whitespace ORStreet
followed by whitespace- one or more integers
- an optional letter
and all that at the end of the string ($
).
The pattern might need some tweaking but gives the right result for the example.