I know that beautiful soup has a function to match classes based on regex that contains certain strings, based on a post here. Below is a code example from that post:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
Now, is it possible to do the opposite? Basically, find classes that do not contain a certain regex. In SQL language, it's like:
where class not like '%test%'
Thanks in advance!
CodePudding user response:
This actually can be done by using Negative Lookahead
Negative Lookahead has the following syntax (?!«pattern»)
and matches if pattern
does not match what comes before the current location in the input string.
In your case, you could use the following regex to match all classes that don’t contain listing-col-
in their name:
regex = re.compile('^((?!listing-col-).)*$')
Here’s the pretty simple and straightforward explanation of this regex ^((?!listing-col-).)*$
:
^
asserts position at start of a line- Capturing Group
((?!listing-col-).)*
*
matches the previous token between zero and unlimited times, as many times as possible, giving back as needed- Negative Lookahead
(?!listing-col-)
. Assert that the Regex below does not match.listing-col-
matches the characterslisting-col-
literally (case sensitive) .
matches any character
$
asserts position at the end of a line
Also, you may find the https://regex101.com site useful
It will help you test your patterns and show you a detailed explanation of each step. It's your best friend in writing regular expressions.
CodePudding user response:
One possible solution is utilizing regex directly. You can refer to Regular expression to match a line that doesn't contain a word.
Or you can introduce a function to implement the logic and pass it to find_all
as a parameter.
You can refer to https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all
CodePudding user response:
You can use css selector syntax with :not() pseudo class and * contains operator
data = [i.text() for i in soup.select('div[class]:not([class*="listing-col-"])')]