python - Match Everything except the string regex-CodePudding

Data Set

Cider
631

Spruce
871

Honda
18813

Nissan
3292

Pine
10621

Walnut
10301

Code

#!/usr/bin/python
import re

text = "Cider\n631\n\nSpruce\n871Honda\n18813\n\nNissan\n3292\n\nPine\n10621\n\nWalnut\n10301\n\n"

f1 = re.findall(r"(Cider|Pine)\n(.*)",text)

print(f1)

Current Result

[('Cider', '631'), ('Pine', '10621')]

Question:

How do I change the regex from matching everything except several specified strings? ex (Honda|Nissan)

Desired Result

[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]

CodePudding user response：

inverse it with caret ‘^’ symbol.

f1 = re.findall(r"(\s?^(Cider|Pine))\n(.*)",text)

Keep in mind that caret symbol (in regex) has a special meaning if it is used as a first character match which then would alternatively mean to be “does it start at the beginning of a line”.

Thats why one would insert a “non-usable character” in the beginning. I chosed an optional single space to use up that first character thereby rendering the meaning of the caret (^) symbol as NOT to mean “the beginning of the line”, but to get the desired inverse operator.

CodePudding user response：

You can exclude matching either of the names or only digits, and then match the 2 lines starting with at least a non whitespace char.

^(?!(?:Honda|Nissan|\d )$)(\S.*)\n(.*)

The pattern matches:

^ Start of string
(?! Negative lookahead, assert not directly to the right
- (?:Honda|Nissan|\d )$ Match any of the alternatives at followed by asserting the end of the string
) Close lookahead
(\S.*) Capture group 1, match a non whitespace char followed by the rest of the line
\n Match a newline
(.*) Capture group 2, match any character except a newline

Regex demo

import re

text = ("Cider\n"
            "631\n\n"
            "Spruce\n"
            "871\n\n"
            "Honda\n"
            "18813\n\n"
            "Nissan\n"
            "3292\n\n"
            "Pine\n"
            "10621\n\n"
            "Walnut\n"
            "10301")
f1 = re.findall(r"^(?!(?:Honda|Nissan|\d )$)(\S.*)\n(.*)", text, re.MULTILINE)

print(f1)

Output

[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]

If the line should start with an uppercase char A-Z and the next line should consist of only digits:

^(?!Honda|Nissan)([A-Z].*)\n(\d )$

This pattern matches:

^ Start of string
(?!Honda|Nissan) Negative lookahead, assert not Honda or Nissan directly to the right
([A-Z].*) Capture group 1, match an uppercase char A-Z followed by the rest of the line
\n Match a newline
(\d ) Capture group 2, match 1 digits
$ End of string

Regex demo