Python : replacing everything with a backslash till next white space-CodePudding

As part of preprocessing my data. I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example: \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which i want to remove. I have created an tags to be eradicated list which I aim to consume in regular expression to clean the extracted text

Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.

Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.

I am using the regular expression based replace function in python:

udpated = re.sub(r'/\fs\d ', '')

but this is not fetching the desired result. Alternately I have build an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.

CodePudding user response：

Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:

\s?(?<!\S)\\[a-z\d]

And replace with nothing. See an online demo.

\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d] - 1 (Greedy) Characters as per given class.

CodePudding user response：

First, the / doesn't belong in the regular expression at all.

Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d '.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.

Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.

>>> re.sub(r'\\fs\d ', '', r"This is a string \fs24 and it contains...")
'This is a string  and it contains...'

CodePudding user response：

Does that work for you?

re.sub(
    r"\\\w \s*",  # a backslash followed by alphanumerics;
    '',           # replace it with an empty string;
    input_string  # in your input string
)

>>> re.sub(r"\\\w \s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w \s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w \s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w \s*", "", r"\fs24hello \qc23424 there")
'there'

CodePudding user response：

'\\' matches '\' and 'w ' matches a word until space

import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w ', '', s)

output:

'This is a string  and it contains some texts and tags . which I want to remove from my string.'

CodePudding user response：

I tried this and it worked fine for me:

def remover(text, state):
    if state:
        removable = text.split("\\")[1]
        removable = removable.split(" ")[0]
        removable = "\\"   removable   " "
        print(removable)
        text = text.replace(removable, "")
        state = True if "\\" in text else False
    return text, state


text = "hello \\I'm new here \\good luck"
state = True
while state:
    text, state = remover(text, state)
print(text)