I have a transcription file where there are some words starting with &= . My aim is to remove all these words from the file
For example:
&=laughs How are &=breaths you?
I want it to output like
How are you?
Can someone please help me with this since I an mew to regex and still learning
CodePudding user response:
\S
can match the word and \s*
to match trailing spaces. This gives the pattern &=\S \s*
; with re.sub
, you get (demo):
import re
s = '&=laughs How are &=breaths you?'
re.sub(r'&=\S \s*', '', s)
Output: 'How are you?'
If really you only want to delete words that start with &=
but not those that have it within, there are a few options, depending on the exact context you want to allow or disallow.
Python supports the non-word-boundary anchor, \B
(see also word boundaries on Regular-Expressions.info), which can be used to match only those non-verbals that don't occur within a word (demo):
\B&=\S \s*
For more restrictive matching, you can use positive lookbehinds (also on Regular-Expressions.info) before the pattern. Python restricts lookbehind patterns to have a fixed length. For example, you can't have:
(?<=^|[\s.?!]|--)
But you can use:
(?:^|(?<=[\s.?!])|(?<=--))
(demo)
You could also use negative lookbehinds, but with care (see "Regular expression negative lookbehind for multiple values", "Multiple negative lookbehind assertions in python regex?"). For example, (?<![A-Za-z])
is similar to using \B
, except that it doesn't take non-ASCII Unicode letters into account (demo). To use multiple negative lookbehinds of differing lengths, use concatenation rather than alternation (demo):
(?<![A-Za-z])(?<![^-]-)&=\S \s*
CodePudding user response:
just with split and join.
s = "&=laughs How are &=breaths you?"
' '.join([w for w in s.split(" ") if not "&=" in w])
How are you?