Remove non-numeric characters including numbers that form a URL-CodePudding

I have a string which is comprised of a set of numbers and a URL. I only need all numeric characters except the ones attached to the URL. Below is my code to remove all non-numeric characters but it doesn't remove the numbers from the URL.

test = '4758 11b98https://www.website11/111'
re.sub("[^0-9]","",test)

expected result: 47581198

CodePudding user response：

original answer

Change strategy, it is much easier to just keep the leading numbers and ignore the rest:

import re
test = '47581198https://www.website11/111'
re.findall(r'^\d ', test)[0]

Or, using match, if it is not sure that the leading numbers are present:

m = re.match(r'\d ', test)
if m:
    m = m.group()

Output: '47581198'

Edit after question change

If you're sure that the 'http://' string cannot be in your initial number.

Then you need two passes, one to remove the URL, and another to clean the number.

test = '4758 11b98https://www.website11/1111'
re.sub('\D', '', re.sub('https?://.*', '', test))

Output: '47581198'

CodePudding user response：

Please check the below expression:

y=re.compile('([0-9] )(?=.*http)')
tokens = y.findall(test)
print(''.join(tokens))

CodePudding user response：

You could match a string that contains https:// or http:// to not capture digits attached to it, and use an alternation | to capture the other digits in group 1.

Then in the output, join all the digits from group 1 with an empty string.

https?://\S |(\d )

Regex demo | Python demo

For example

import re

pattern = r"https?://\S |(\d )"
s = "4758 11b98https://www.website11/111"

print(''.join(re.findall(pattern, s)))

Output

47581198