Remove all numbers except for the ones combined to string using python regex-CodePudding

Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. E.g.

ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//

The \W removes all numbers including 1 in malwmrll1

import re

text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d \\b\s*$ \sORIGIN$\W ]', '', text_file)

print(new_txt, len(new_txt))

My output is:

malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109

The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

CodePudding user response：

Right, depending on your desired result showing underscores at all or not, try to use re.findall and raw-string notation. You currently use a character class that makes no sense:

\b(?!(?:ORIGIN|[_\d] )\b)\w

See an online demo

\b - Word-boundary;
(?!(?:ORIGIN|[_\d] )\b) - Negative lookahead with nested non-capture group to match either ORIGIN or 1 underscore/digit combinations before a trailing word-boundary;
\w - 1 word-characters.

import re
  
text_file = """ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//"""

new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d] )\b)\w ', text_file))    
print(new_txt, len(new_txt))

Prints:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

CodePudding user response：

Using RE for this is an interesting academic exercise but extending the functionality is fraught with danger unless one is very familiar with the technique.

This answer may look long-winded but you should be able to see how easy it would be to extend it so that other tokens/patterns can be excluded or included. It's also readily maintainable because anyone else having to modify the code isn't going to get a migraine while trying to figure out how the RE works.

FILENAME = 'mytext.txt'

def keep(t):
    if t.isdigit() or t == 'ORIGIN' or t == '//':
        return False
    return True

with open(FILENAME) as f:
    new_txt = ''.join(filter(keep, f.read().split()))
    print(new_txt, len(new_txt))

Output:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

CodePudding user response：

Another idea:

new_txt = re.sub('[\\W_] |\\b(?:\\d |ORIGIN)\\b', '', text_file)

Strip out all non word characters underscore OR digits / "ORIGIN" between word boundaries.

See this demo at tio.run (the regex is very basic, explanation at regex101)