Home > Mobile >  How to remove a specific pattern from re.findall() results
How to remove a specific pattern from re.findall() results

Time:04-07

I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:

 'ERIN E. SCHNEIDER',
 'MONIQUE C. WINKLER',
 'JASON M. HABERMEYER',
 'MARC D. KATZ',
 'JESSICA W. CHAN',
 'RAHUL KOLHATKAR',
 'TSPU or taken',
 'TSPU or the',
 'TSPU only',
 'TSPU was',
 'TSPU and']

I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?

JINA L. CHOI (NY Bar No. 2699718)

ERIN E. SCHNEIDER (Cal. Bar No. 216114) [email protected]

MONIQUE C. WINKLER (Cal. Bar No. 213031) [email protected]

JASON M. HABERMEYER (Cal. Bar No. 226607) [email protected]

MARC D. KATZ (Cal. Bar No. 189534) [email protected]

JESSICA W. CHAN (Cal. Bar No. 247669) [email protected]

RAHUL KOLHATKAR (Cal. Bar No. 261781) [email protected]

  1. The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]

CodePudding user response:

You can do some simple .filter-ing, if your array was results,

removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))

CodePudding user response:

You can use

\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s \w\.)?\s \w )?

See this regex demo. Details:

  • \b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
  • (?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
  • [A-Z]{4,} - four or more uppercase ASCII letters
  • (?:(?:\s \w\.)?\s \w )? - an optional occurrence of:
    • (?:\s \w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
    • \s - one or more whitespaces
    • \w - one or more word chars.

In Python, you can use

re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s \w\.)?\s \w )?', text)
  • Related