In the following text, I want to remove everything inside the parenthesis including number and string. I use the following syntax but I got result of 22701 instead of 2270. What would be a way to show 2270 only using re.sub? Thanks
import regex as re
import numpy as np
import pandas as pd
text = "2270 (1st xyz)"
text_new = re.sub(r"[a-zA-Z()\s]","",text)
text_new
CodePudding user response:
Does the text always follow the same pattern? Try:
import re
import numpy as np
import pandas as pd
text = "2270 (1st xyz)"
text_new = re.sub(r"\s\([^)]*\)","",text)
print(text_new)
Output:
2270
CodePudding user response:
Simply use the regex pattern \(.*?\)
:
import re
text = "2270 (1st xyz)"
text_new = re.sub("\(.*?\)", "", text)
print(text_new)
Output:
2270
Explanation on the pattern \(.*?\)
:
- The
\
behind each parenthesis is to tellre
to treat the parenthesis as a regular character, as they are by default special characters inre
. - The
.
matches any character except the newline character. - The
*
matches zero or more occurrences of the pattern immediately specified before the*
. - The
?
tells re to match as little text as possible, thus making it non-greedy.
Note the trailing space in the output. To remove it, simply add it to the pattern:
import re
text = "2270 (1st xyz)"
text_new = re.sub(" \(.*?\)", "", text)
print(text_new)
Output:
2270