I'm interested in inserting a character (comma in this case ) after the last set of numbers if they are present in values of a Pandas column. A sample of the original dataframe is as below:
import pandas as pd
data = {'ID': ['1', '2', '3'], 'Address': ['123 Nelson Avenue, Redmont Central, Redmont 0987', '123 Nelson Avenue, Redmont Central, Redmont', '123 Nelson Avenue, Redmont Central, Redmont 87']}
df_addresses = pd.DataFrame(data)`
Expected df output is as below: -
data_expected = {'ID': ['1', '2', '3'], 'Address': ['123 Nelson Avenue, Redmont Central, Redmont, 0987', '123 Nelson Avenue, Redmont Central, Redmont', '123 Nelson Avenue, Redmont Central, Redmont, 87']}
df_addresses_expected = pd.DataFrame(data_expected)
Ideally, a comma is inserted before the last set of numbers in the column value. If the last set of characters is not a number-like value, the column is left as it is. Any thoughts around this?
CodePudding user response:
You could define a function to insert a comma into a string as you describe, and apply that function to the address column:
def insert_comma(string):
if string == '':
return string
if string[-1] not in '0123456789':
return string
words = string.split(' ')
if len(words) <= 1:
return string
words[-2] = ','
return ' '.join(words)
df_addresses['Address'] = df_addresses['Address'].apply(insert_comma)
df_addresses
ID Address
0 1 123 Nelson Avenue, Redmont Central, Redmont, 0987
1 2 123 Nelson Avenue, Redmont Central, Redmont
2 3 123 Nelson Avenue, Redmont Central, Redmont, 87
CodePudding user response:
Something like this?
def check_for_final_comma(input_str):
input_str_split = input_str.split(' ')
last_word = input_str_split[-1]
if last_word.isnumeric():
if input_str_split[-2][-1] != ",":
input_str_split[-2] = ","
return " ".join(input_str_split)
return input_str
df_addresses['Address_New'] = df_addresses['Address'].apply(check_for_final_comma)
CodePudding user response:
You can do it without using apply
like this:
df['Address'] = df['Address'].str.extract('(. ?)\s*(\d )?$').fillna('').assign(t=', ')[[0, 't', 1]].sum(axis=1).str.strip(', ')
Output:
>>> df
ID Address
0 1 123 Nelson Avenue, Redmont Central, Redmont, 0987
1 2 123 Nelson Avenue, Redmont Central, Redmont
2 3 123 Nelson Avenue, Redmont Central, Redmont, 87