Home > OS >  Beautiful soup if class not like "string" or regex
Beautiful soup if class not like "string" or regex

Time:07-22

I know that beautiful soup has a function to match classes based on regex that contains certain strings, based on a post here. Below is a code example from that post:

regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
        print EachPart.get_text()

Now, is it possible to do the opposite? Basically, find classes that do not contain a certain regex. In SQL language, it's like:

where class not like '%test%'

Thanks in advance!

CodePudding user response:

This actually can be done by using Negative Lookahead

Negative Lookahead has the following syntax (?!«pattern») and matches if pattern does not match what comes before the current location in the input string.

In your case, you could use the following regex to match all classes that don’t contain listing-col- in their name:

regex = re.compile('^((?!listing-col-).)*$')

Here’s the pretty simple and straightforward explanation of this regex ^((?!listing-col-).)*$:

  • ^ asserts position at start of a line
  • Capturing Group ((?!listing-col-).)*
    • * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed
    • Negative Lookahead (?!listing-col-). Assert that the Regex below does not match. listing-col- matches the characters listing-col- literally (case sensitive)
    • . matches any character
  • $ asserts position at the end of a line

Also, you may find the https://regex101.com site useful

It will help you test your patterns and show you a detailed explanation of each step. It's your best friend in writing regular expressions.

CodePudding user response:

One possible solution is utilizing regex directly. You can refer to Regular expression to match a line that doesn't contain a word.

Or you can introduce a function to implement the logic and pass it to find_all as a parameter. You can refer to https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all

CodePudding user response:

You can use css selector syntax with :not() pseudo class and * contains operator

data = [i.text() for i in soup.select('div[class]:not([class*="listing-col-"])')]
  • Related