As part of preprocessing my data. I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example: \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which i want to remove. I have created an tags to be eradicated list which I aim to consume in regular expression to clean the extracted text
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in python:
udpated = re.sub(r'/\fs\d ', '')
but this is not fetching the desired result. Alternately I have build an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
CodePudding user response:
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]
And replace with nothing. See an online demo.
\s?
- Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);(?<!\S)
- Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);\\
- A literal backslash.[a-z\d]
- 1 (Greedy) Characters as per given class.
CodePudding user response:
First, the /
doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \
itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d '
.) The \
before f
is meant to be used literally; the \
before d
is part of the character class matching the digits.
Finally, sub
takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d ', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
CodePudding user response:
Does that work for you?
re.sub(
r"\\\w \s*", # a backslash followed by alphanumerics;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w \s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w \s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w \s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w \s*", "", r"\fs24hello \qc23424 there")
'there'
CodePudding user response:
'\\' matches '\' and 'w ' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w ', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
CodePudding user response:
I tried this and it worked fine for me:
def remover(text, state):
if state:
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" removable " "
print(removable)
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)