I have a string which is comprised of a set of numbers and a URL. I only need all numeric characters except the ones attached to the URL. Below is my code to remove all non-numeric characters but it doesn't remove the numbers from the URL.
test = '4758 11b98https://www.website11/111'
re.sub("[^0-9]","",test)
expected result: 47581198
CodePudding user response:
original answer
Change strategy, it is much easier to just keep the leading numbers and ignore the rest:
import re
test = '47581198https://www.website11/111'
re.findall(r'^\d ', test)[0]
Or, using match, if it is not sure that the leading numbers are present:
m = re.match(r'\d ', test)
if m:
m = m.group()
Output: '47581198'
Edit after question change
If you're sure that the 'http://' string cannot be in your initial number.
Then you need two passes, one to remove the URL, and another to clean the number.
test = '4758 11b98https://www.website11/1111'
re.sub('\D', '', re.sub('https?://.*', '', test))
Output: '47581198'
CodePudding user response:
Please check the below expression:
y=re.compile('([0-9] )(?=.*http)')
tokens = y.findall(test)
print(''.join(tokens))
CodePudding user response:
You could match a string that contains https:// or http:// to not capture digits attached to it, and use an alternation |
to capture the other digits in group 1.
Then in the output, join all the digits from group 1 with an empty string.
https?://\S |(\d )
For example
import re
pattern = r"https?://\S |(\d )"
s = "4758 11b98https://www.website11/111"
print(''.join(re.findall(pattern, s)))
Output
47581198