Home > Software engineering >  Regex expression to remove words starting from&=
Regex expression to remove words starting from&=

Time:08-18

I have a transcription file where there are some words starting with &= . My aim is to remove all these words from the file

For example:

&=laughs How are &=breaths you?

I want it to output like

How are you?

Can someone please help me with this since I an mew to regex and still learning

CodePudding user response:

\S can match the word and \s* to match trailing spaces. This gives the pattern &=\S \s*; with re.sub, you get (demo):

import re

s = '&=laughs How are &=breaths you?'
re.sub(r'&=\S \s*', '', s)

Output: 'How are you?'

If really you only want to delete words that start with &= but not those that have it within, there are a few options, depending on the exact context you want to allow or disallow.

Python supports the non-word-boundary anchor, \B (see also word boundaries on Regular-Expressions.info), which can be used to match only those non-verbals that don't occur within a word (demo):

\B&=\S \s*

For more restrictive matching, you can use positive lookbehinds (also on Regular-Expressions.info) before the pattern. Python restricts lookbehind patterns to have a fixed length. For example, you can't have:

(?<=^|[\s.?!]|--)

But you can use:

(?:^|(?<=[\s.?!])|(?<=--))

(demo)

You could also use negative lookbehinds, but with care (see "Regular expression negative lookbehind for multiple values", "Multiple negative lookbehind assertions in python regex?"). For example, (?<![A-Za-z]) is similar to using \B, except that it doesn't take non-ASCII Unicode letters into account (demo). To use multiple negative lookbehinds of differing lengths, use concatenation rather than alternation (demo):

(?<![A-Za-z])(?<![^-]-)&=\S \s*

CodePudding user response:

just with split and join.

s = "&=laughs How are &=breaths you?"

' '.join([w for w in s.split(" ") if not "&=" in w])

How are you?
  • Related