As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24
need to be replaced with empty or \qc23424
with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d ', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
CodePudding user response:
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]
And replace with nothing. See an online demo.
\s?
- Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);(?<!\S)
- Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);\\
- A literal backslash.[a-z\d]
- 1 (Greedy) Characters as per given class.
CodePudding user response:
First, the /
doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \
itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d '
.) The \
before f
is meant to be used literally; the \
before d
is part of the character class matching the digits.
Finally, sub
takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d ', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
CodePudding user response:
Does that work for you?
re.sub(
r"\\\w \s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w \s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w \s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w \s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w \s*", "", r"\fs24hello \qc23424 there")
'there'
CodePudding user response:
'\\' matches '\' and 'w ' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w ', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
CodePudding user response:
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" removable " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)