I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)
) is returning this:
'ERIN E. SCHNEIDER',
'MONIQUE C. WINKLER',
'JASON M. HABERMEYER',
'MARC D. KATZ',
'JESSICA W. CHAN',
'RAHUL KOLHATKAR',
'TSPU or taken',
'TSPU or the',
'TSPU only',
'TSPU was',
'TSPU and']
I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?
JINA L. CHOI (NY Bar No. 2699718)
ERIN E. SCHNEIDER (Cal. Bar No. 216114) [email protected]
MONIQUE C. WINKLER (Cal. Bar No. 213031) [email protected]
JASON M. HABERMEYER (Cal. Bar No. 226607) [email protected]
MARC D. KATZ (Cal. Bar No. 189534) [email protected]
JESSICA W. CHAN (Cal. Bar No. 247669) [email protected]
RAHUL KOLHATKAR (Cal. Bar No. 261781) [email protected]
- The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]
CodePudding user response:
You can do some simple .filter
-ing, if your array was results
,
removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))
CodePudding user response:
You can use
\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s \w\.)?\s \w )?
See this regex demo. Details:
\b
- a word boundary (else, the regex may "catch" a part of a word that containsTSPU
)(?!TSPU\b)
- a negative lookahead that fails the match if there isTSPU
string followed with a non-word char or end of string immediately to the right of the current location[A-Z]{4,}
- four or more uppercase ASCII letters(?:(?:\s \w\.)?\s \w )?
- an optional occurrence of:(?:\s \w\.)?
- an optional occurrence of one or more whitespaces, a word char and a literal.
char\s
- one or more whitespaces\w
- one or more word chars.
In Python, you can use
re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s \w\.)?\s \w )?', text)