Home > Back-end >  Remove all numbers except for the ones combined to string using python regex
Remove all numbers except for the ones combined to string using python regex

Time:06-02

Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. E.g.

ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//

The \W removes all numbers including 1 in malwmrll1

import re

text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d \\b\s*$ \sORIGIN$\W ]', '', text_file)

print(new_txt, len(new_txt))

My output is:

malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109

The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

CodePudding user response:

Right, depending on your desired result showing underscores at all or not, try to use re.findall and raw-string notation. You currently use a character class that makes no sense:


\b(?!(?:ORIGIN|[_\d] )\b)\w 

See an online demo


  • \b - Word-boundary;
  • (?!(?:ORIGIN|[_\d] )\b) - Negative lookahead with nested non-capture group to match either ORIGIN or 1 underscore/digit combinations before a trailing word-boundary;
  • \w - 1 word-characters.

import re
  
text_file = """ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//"""

new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d] )\b)\w ', text_file))    
print(new_txt, len(new_txt))

Prints:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

CodePudding user response:

Using RE for this is an interesting academic exercise but extending the functionality is fraught with danger unless one is very familiar with the technique.

This answer may look long-winded but you should be able to see how easy it would be to extend it so that other tokens/patterns can be excluded or included. It's also readily maintainable because anyone else having to modify the code isn't going to get a migraine while trying to figure out how the RE works.

FILENAME = 'mytext.txt'

def keep(t):
    if t.isdigit() or t == 'ORIGIN' or t == '//':
        return False
    return True

with open(FILENAME) as f:
    new_txt = ''.join(filter(keep, f.read().split()))
    print(new_txt, len(new_txt))

Output:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

CodePudding user response:

Another idea:

new_txt = re.sub('[\\W_] |\\b(?:\\d |ORIGIN)\\b', '', text_file)

Strip out all non word characters underscore OR digits / "ORIGIN" between word boundaries.

See this demo at tio.run (the regex is very basic, explanation at regex101)

  • Related