Data Set
Cider
631
Spruce
871
Honda
18813
Nissan
3292
Pine
10621
Walnut
10301
Code
#!/usr/bin/python
import re
text = "Cider\n631\n\nSpruce\n871Honda\n18813\n\nNissan\n3292\n\nPine\n10621\n\nWalnut\n10301\n\n"
f1 = re.findall(r"(Cider|Pine)\n(.*)",text)
print(f1)
Current Result
[('Cider', '631'), ('Pine', '10621')]
Question:
How do I change the regex from matching everything except several specified strings? ex (Honda|Nissan)
Desired Result
[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]
CodePudding user response:
inverse it with caret ‘^’ symbol.
f1 = re.findall(r"(\s?^(Cider|Pine))\n(.*)",text)
Keep in mind that caret symbol (in regex) has a special meaning if it is used as a first character match which then would alternatively mean to be “does it start at the beginning of a line”.
Thats why one would insert a “non-usable character” in the beginning. I chosed an optional single space to use up that first character thereby rendering the meaning of the caret (^) symbol as NOT to mean “the beginning of the line”, but to get the desired inverse operator.
CodePudding user response:
You can exclude matching either of the names or only digits, and then match the 2 lines starting with at least a non whitespace char.
^(?!(?:Honda|Nissan|\d )$)(\S.*)\n(.*)
The pattern matches:
^
Start of string(?!
Negative lookahead, assert not directly to the right(?:Honda|Nissan|\d )$
Match any of the alternatives at followed by asserting the end of the string
)
Close lookahead(\S.*)
Capture group 1, match a non whitespace char followed by the rest of the line\n
Match a newline(.*)
Capture group 2, match any character except a newline
import re
text = ("Cider\n"
"631\n\n"
"Spruce\n"
"871\n\n"
"Honda\n"
"18813\n\n"
"Nissan\n"
"3292\n\n"
"Pine\n"
"10621\n\n"
"Walnut\n"
"10301")
f1 = re.findall(r"^(?!(?:Honda|Nissan|\d )$)(\S.*)\n(.*)", text, re.MULTILINE)
print(f1)
Output
[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]
If the line should start with an uppercase char A-Z and the next line should consist of only digits:
^(?!Honda|Nissan)([A-Z].*)\n(\d )$
This pattern matches:
^
Start of string(?!Honda|Nissan)
Negative lookahead, assert not Honda or Nissan directly to the right([A-Z].*)
Capture group 1, match an uppercase char A-Z followed by the rest of the line\n
Match a newline(\d )
Capture group 2, match 1 digits$
End of string