I have a dataframe from pandas :
id adress
0 Jame Homie Street. N:60 5555242424 La
1 London. 2322325234243 Stw St. N 8 St.bridge
2 32424244234 ddd st. ss Sk. N 63 Manchester
3 Mou st 147 Rochester Liv 33424245223
I want to separate that is the numbers(like 5555242424 ,2322325234243 , 32424244234 ,33424245223 )and create a new feature.
Sample output :
id adress number
0 Jame Homie Street. N:60 La 5555242424
1 London. Stw St. N 8 St.bridge 2322325234243
2 ddd st. ss Sk. N 63 Manchester 32424244234
3 Mou st 147 Rochester Liv 3424245223
CodePudding user response:
Assuming you want to extract the first number that has at least 4 digits (so it ignores 60, 8, 63, 147 in your example), you can use:
df_payers["number"] = df_payers["adress"].str.extract("(\d{4,})")
df_payers["adress"] = df_payers["adress"].str.replace("(\d{4,})","",regex=True)
>>> df_payers
id adress number
0 0 Jame Homie Street. N:60 La 5555242424
1 1 London. Stw St. N 8 St.bridge 2322325234243
2 2 ddd st. ss Sk. N 63 Manchester 32424244234
3 3 Mou st 147 Rochester Liv 33424245223
CodePudding user response:
List comprehension with split at length 3 from other digits. You can change there if you want to increase.
df = pd.DataFrame({
"adress":["Jame Homie Street. N:60 5555242424 La","London. 2322325234243 Stw St. N 8 St.bridge",
"32424244234 ddd st. ss Sk. N 63 Manchester","Mou st 147 Rochester Liv 33424245223"],
})
cleanedAdress = []
numbers = []
for i in df.values:
tempSplit = i[0].split()
numericEx = [s for s in tempSplit if s.isdigit() if len(s) > 3]
strEx = ''.join(numericEx)
numbers.append(strEx)
tempSplit.remove(strEx)
tempSplit = ' '.join(tempSplit)
cleanedAdress.append(tempSplit)
dfCleaned = pd.DataFrame({"adress":cleanedAdress,"numbers":numbers})
dfCleaned
adress numbers
0 Jame Homie Street. N:60 La 5555242424
1 London. Stw St. N 8 St.bridge 2322325234243
2 ddd st. ss Sk. N 63 Manchester 32424244234
3 Mou st 147 Rochester Liv 33424245223
CodePudding user response:
If you know all the addresses patterns you can use some regular expressions in order to extract the values.
Since in the example you provided each line is totally different from the others, something you can do is to rely on the addr number length to build a single regex and then split this from the rest.
import re
raw_addrs = """0 Jame Homie Street. N:60 5555242424 La
1 London. 2322325234243 Stw St. N 8 St.bridge
2 32424244234 ddd st. ss Sk. N 63 Manchester
3 Mou st 147 Rochester Liv 33424245223""".split('\n')
id_addrs_regex = r'^(?P<id>\d )\s (?P<addr>.*)$'
id_addrs = [(match.group('id'), match.group('addr')) for match in data]
number_re = r'\d{6,}'
numbers = [re.search(number_re, addr).group() for _, addr in id_addrs]
output = [(id_addr[0], ' '.join(id_addr[1].replace(number, "").split()), number) for id_addr, number in zip(id_addrs, numbers)]
The output is:
[('0', 'Jame Homie Street. N:60 La', '5555242424'),
('1', 'London. Stw St. N 8 St.bridge', '2322325234243'),
('2', 'ddd st. ss Sk. N 63 Manchester', '32424244234'),
('3', 'Mou st 147 Rochester Liv', '33424245223')]
Hope it helps, it's just an idea, and of course the code can be better.