How to separate spesific number from text data on python-CodePudding

I have a dataframe from pandas :

id     adress

0     Jame Homie Street. N:60 5555242424 La
1     London. 2322325234243 Stw St. N 8 St.bridge
2     32424244234 ddd st. ss Sk. N 63 Manchester
3     Mou st 147 Rochester Liv 33424245223

I want to separate that is the numbers(like 5555242424 ,2322325234243 , 32424244234 ,33424245223 )and create a new feature.

Sample output :

id     adress                                           number

0     Jame Homie Street. N:60 La                      5555242424 
1     London. Stw St. N 8 St.bridge                   2322325234243 
2     ddd st. ss Sk. N 63 Manchester                  32424244234 
3     Mou st 147 Rochester Liv                        3424245223

CodePudding user response：

Assuming you want to extract the first number that has at least 4 digits (so it ignores 60, 8, 63, 147 in your example), you can use:

df_payers["number"] = df_payers["adress"].str.extract("(\d{4,})")
df_payers["adress"] = df_payers["adress"].str.replace("(\d{4,})","",regex=True)

>>> df_payers
   id                           adress         number
0   0      Jame Homie Street. N:60  La     5555242424
1   1   London.  Stw St. N 8 St.bridge  2322325234243
2   2   ddd st. ss Sk. N 63 Manchester    32424244234
3   3        Mou st 147 Rochester Liv     33424245223

CodePudding user response：

List comprehension with split at length 3 from other digits. You can change there if you want to increase.

df = pd.DataFrame({
    "adress":["Jame Homie Street. N:60 5555242424 La","London. 2322325234243 Stw St. N 8 St.bridge",
    "32424244234 ddd st. ss Sk. N 63 Manchester","Mou st 147 Rochester Liv 33424245223"],
})

cleanedAdress = []
numbers = []
for i in df.values:
    tempSplit = i[0].split()
    numericEx = [s for s in tempSplit if s.isdigit() if len(s) > 3]
    strEx = ''.join(numericEx)
    numbers.append(strEx)

    tempSplit.remove(strEx)
    tempSplit = ' '.join(tempSplit)
    cleanedAdress.append(tempSplit)

dfCleaned = pd.DataFrame({"adress":cleanedAdress,"numbers":numbers})

dfCleaned

                           adress        numbers
0      Jame Homie Street. N:60 La     5555242424
1   London. Stw St. N 8 St.bridge  2322325234243
2  ddd st. ss Sk. N 63 Manchester    32424244234
3        Mou st 147 Rochester Liv    33424245223

CodePudding user response：

If you know all the addresses patterns you can use some regular expressions in order to extract the values.

Since in the example you provided each line is totally different from the others, something you can do is to rely on the addr number length to build a single regex and then split this from the rest.

import re

raw_addrs = """0     Jame Homie Street. N:60 5555242424 La
1     London. 2322325234243 Stw St. N 8 St.bridge
2     32424244234 ddd st. ss Sk. N 63 Manchester
3     Mou st 147 Rochester Liv 33424245223""".split('\n')

id_addrs_regex = r'^(?P<id>\d )\s (?P<addr>.*)$'
id_addrs = [(match.group('id'), match.group('addr')) for match in data]

number_re = r'\d{6,}'
numbers = [re.search(number_re, addr).group() for _, addr in id_addrs]

output = [(id_addr[0], ' '.join(id_addr[1].replace(number, "").split()), number) for id_addr, number in zip(id_addrs, numbers)]

The output is:

[('0', 'Jame Homie Street. N:60 La', '5555242424'),
 ('1', 'London. Stw St. N 8 St.bridge', '2322325234243'),
 ('2', 'ddd st. ss Sk. N 63 Manchester', '32424244234'),
 ('3', 'Mou st 147 Rochester Liv', '33424245223')]

Hope it helps, it's just an idea, and of course the code can be better.