Imagine I have a string like dslkf 234 dkf23 12asd 2 23 4
.
I want to replace all standalone numbers with <NUM>
.
I have tried re.sub('\s\d \s', ' <NUM> ', s)
. I want to have dslkf <NUM> dkf23 12asd <NUM> <NUM> <NUM>
in the end but what I get is: dslkf <NUM> dkf23 12asd <NUM> 23 4
I know why the "4" is not replaced because it's not followed by any space character. But for the other one I couldn't find out why.
CodePudding user response:
Do a replacement on \b\d \b
:
inp = "dslkf 234 dkf23 12asd 2 23 4"
output = re.sub(r'\b\d \b', r'<NUM>', inp)
print(output) # dslkf <NUM> dkf23 12asd <NUM> <NUM> <NUM>
CodePudding user response:
I found the answer myself.
Using lookbehind and lookahead is very helpful.
and for the end of the string, I got help from $
sign.
the code looks like this:
pattern = "(?<=\s)\d (?=\s|$)"
new_s = re.sub(pattern, '<NUM>', s)
Although I found my answer before posting but since I didn't find a similar question, I still posted the question for future seekers.
CodePudding user response:
You don't necessarily need regex to do this, here is a faster alternative using split()
and join()
:
data = "dslkf 234 dkf23 12asd 2 23 4"
new_data = " ".join(word if not word.isdigit() else "<NUM>" for word in data.split())
print(new_data) # dslkf <NUM> dkf23 12asd <NUM> <NUM> <NUM>
- We split the sentences into words, and for each words we check if it's a digit. If so, we replace it with
<NUM>
.