How to delete part of a string by matching conditions?-CodePudding

I have many addresses information, such as:

123 1st Ave Apt501, Flushing, New York, 00000, USA
234 West 20th Street 1A, New York, New York, 11111, USA
345 North 100st Street Apt. 110, New York, New York, 22222, USA

I would like to get the street information. So, I am wondering how can I delete the apartment information after "Ave", and "Street"?

So, the addresses will be cleaned as:

123 1st Ave, Flushing, New York, 00000, USA
234 West 20th Street, New York, New York, 11111, USA
345 North 100st Street, New York, New York, 22222, USA

Or the data can be cleaned as:

123 1st Ave
234 West 20th Street
345 North 100st Street

This is the code I tried. However, I was not able to remove apartment information not including "apt".

conditions = [df.address.str.contains('Apt')]
choices = [df.address.apply(lambda x: x[x.find('Apt'):])]
df['apt'] = np.select(conditions, choices, default = '')
choices2 = [df.address.apply(lambda x: x[:x.find('Apt')])]
df['address'] = np.select(conditions, choices2, default = df.address)

CodePudding user response：

I think you should wrap all the addresses in a list and use a split to separate each element in the address so you can access street information by index 0.

addresses  = ['123 1st Ave, Flushing, New York, 00000, USA', '234 West 20th Street, New York, New York, 11111, USA',
        '345 North 100st Street, New York, New York, 22222, USA']

for s in addresses:
    print(s.split(',')[0])

Output

123 1st Ave
234 West 20th Street
345 North 100st Street

CodePudding user response：

To get the second option, I'd split at comma first and then process the first item with a regular expression.

df['street'] = (df.address
       .str.split(',') # split at ,
       .str[0] # get the first element
       .str.replace('(Apt[.\s]*|Street\s )\d \w?$',
       '')
       )

The regular expression matches

Apt followed by zero or more dots or whitespace OR
Street followed by whitespace
one or more integers
an optional letter

and all that at the end of the string ($).

The pattern might need some tweaking but gives the right result for the example.