Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. E.g.
ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
//
The \W removes all numbers including 1 in malwmrll1
import re
text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d \\b\s*$ \sORIGIN$\W ]', '', text_file)
print(new_txt, len(new_txt))
My output is:
malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109
The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
CodePudding user response:
Right, depending on your desired result showing underscores at all or not, try to use re.findall
and raw-string notation. You currently use a character class that makes no sense:
\b(?!(?:ORIGIN|[_\d] )\b)\w
See an online demo
\b
- Word-boundary;(?!(?:ORIGIN|[_\d] )\b)
- Negative lookahead with nested non-capture group to match eitherORIGIN
or 1 underscore/digit combinations before a trailing word-boundary;\w
- 1 word-characters.
import re
text_file = """ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
//"""
new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d] )\b)\w ', text_file))
print(new_txt, len(new_txt))
Prints:
malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
CodePudding user response:
Using RE for this is an interesting academic exercise but extending the functionality is fraught with danger unless one is very familiar with the technique.
This answer may look long-winded but you should be able to see how easy it would be to extend it so that other tokens/patterns can be excluded or included. It's also readily maintainable because anyone else having to modify the code isn't going to get a migraine while trying to figure out how the RE works.
FILENAME = 'mytext.txt'
def keep(t):
if t.isdigit() or t == 'ORIGIN' or t == '//':
return False
return True
with open(FILENAME) as f:
new_txt = ''.join(filter(keep, f.read().split()))
print(new_txt, len(new_txt))
Output:
malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
CodePudding user response:
Another idea:
new_txt = re.sub('[\\W_] |\\b(?:\\d |ORIGIN)\\b', '', text_file)
Strip out all non word characters underscore OR digits / "ORIGIN" between word boundaries.
See this demo at tio.run (the regex is very basic, explanation at regex101)