I try to clean my text. So I need to remove some numbers and also some combinations of numbers and symbols.
I have a string
s = '4/13/2022 2:20:03 pm from our side a more detailed analysis4 7 (495) 797-8700 77-8282'
And I want to get
'pm from our side a more detailed analysis4'
I tried to use
re.compile(r'\b(?:/|-|\ |\:)(\d )\b').sub(r' ', s)
but it returns me
'4 2 pm from our side a more detailed analysis4 7 (495) 797 77 '
What I do wrong and how can I drop just numbers and combinations of number and a specific symbol?
CodePudding user response:
You might match at least a single non word character surrounded by optional digits and trim the result
(?<!\S)\d*(?:[^\w\s] \d*) \s*
Explanation
(?<!\S)
Assert a whitspace boundary to the leeft\d*
Match optional digits(?:[^\w\s] \d*)
Match 1 times at least a non word character and optional digits\s*
Match optional whitespace chars
import re
pattern = r"(?<!\S)\d*(?:[^\w\s] \d*) \s*"
s = "4/13/2022 2:20:03 pm from our side a more detailed analysis4 7 (495) 797-8700 77-8282 kl-1381033 substr1.substr2.ab-2021-44228.a"
print(re.sub(pattern, "", s))
Output
ppm from our side a more detailed analysis4 kl-1381033 substr1.substr2.ab-2021-44228.a
CodePudding user response:
Try this expression :
(?:\/|-|\ |\:|^|\(|\)| ) ?(\d )
You can test it there : https://regex101.com/r/aANxBR/1
CodePudding user response:
It appears you want to remove words that start with digits and symbols.
You could do:
import re
s = '4/13/2022 2:20:03 pm from our side a more detailed analysis4 7 (495) 797-8700 77-8282 kl-1381033 substr1.substr2.ab-2021-44228.a'
>>> ' '.join(w for w in s.split() if not re.match(r'[\d( ]\S ', w))
'pm from our side a more detailed analysis4 kl-1381033 substr1.substr2.ab-2021-44228.a'
Including a completely Python solution:
bad_start='0123456789 ('
>>> ' '.join(w for w in s.split() if w[0] not in bad_start)
'pm from our side a more detailed analysis4 kl-1381033 substr1.substr2.ab-2021-44228.a'