Home > Enterprise >  BeautifulSoup find partial string in section
BeautifulSoup find partial string in section

Time:12-02

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:

<section >
 <a href="https://longGibberishDownloadURL" title="Download">
  <img src="\azure_storage_blob\includes\download_for_windows.png"/>
 </a>
 sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>

The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:

import requests, re
from bs4 import BeautifulSoup

reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))

I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).

I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.

Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.

<section >
 <h3>
  Name
 </h3>
</section>

Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.

Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.

CodePudding user response:

I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.

import requests
from bs4 import BeautifulSoup

reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)

CodePudding user response:

You can query inside every section for the string you want. Like so:

s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))

Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here: https://regex101.com/

  • Related