Home > other >  How do I use find_all or select more precisely in this case?
How do I use find_all or select more precisely in this case?

Time:12-14

When I run the following code after importing stuff:

Fighter1Main = []
for i in range(1,3):
            url = Request(f"https://www.sherdog.com/events/a-{page}", headers={'User-Agent': 'Mozilla/5.0'})
            response = urlopen(url).read()
            soup = BeautifulSoup(response, "html.parser")
            for test2 in soup.find_all(class_="fighter left_side"):
                test3 = test2.find_all(itemprop="url")
                Fighter1Main.append(test3)
            page = page   1

I get:

[[<a href="/fighter/Todd-Medina-61" itemprop="url">
<img alt="Todd 'El Tiburon' Medina" itemprop="image" src="/image_crop/200/300/_images/fighter/20140801074225_IMG_5098.JPG" title="Todd 'El Tiburon' Medina">
</img></a>], [<a href="/fighter/Ricco-Rodriguez-8" itemprop="url">
<img alt="Ricco 'Suave' Rodriguez" itemprop="image" src="/image_crop/200/300/_images/fighter/20141225125221_1MG_9472.JPG" title="Ricco 'Suave' Rodriguez">
</img></a>]]

But I was expecting:

<a href="/fighter/Todd-Medina-61" itemprop="url">
<a href="/fighter/Ricco-Rodriguez-8" itemprop="url">

This is the type of webpage in question https://www.sherdog.com/events/a-1

I also tried using css select and got the same result.

for test2 in soup.select('.fighter.left_side [itemprop="url"]'):
                Fighter1Main.append(test2)

I thought I was using it correctly but I'm not sure how else to narrow it down to what I want.

CodePudding user response:

If your issue is that you're getting a list of lists, and you just want a flat list, then you should do it like

            for test2 in soup.find_all(class_="fighter left_side"): 
                Fighter1Main  = [t for t in test2.find_all(itemprop="url")]

but since you weren't happy with the output from for test2 in soup.select('.fighter.left_side [itemprop="url"]'): Fighter1Main.append(test2), and from your title, I'm guessing that isn't the the problem here.


If you want to filter out any tags that have a nested tag inside them then you can add :not(:has(*)) to your selector

            for test2 in soup.select('.fighter.left_side *[itemprop="url"]:not(:has(*))'): 
                Fighter1Main.append(test2)

however, you can expect an empty list if you do this because [as far as I can tell] all tags matched to .fighter.left_side *[itemprop="url"] will have an img tag nested within.



If you really want something like your expected output, you'll have to either alter the soup or build it up yourself.


You can either remove everything inside the Tags with itemprop="url" [original soup object will be altered]:

            for test2 in soup.select('.fighter.left_side *[itemprop="url"]'): 
                test2.clear()
                Fighter1Main.append(test2) 

Or you could form new html tags with only the href [if there is any] and itemprop attributes [original soup object will remain unaltered, but you'll be parsing and extracting again for each item]:

            soup = BeautifulSoup(response, "html.parser")
            Fighter1Main  = [BeautifulSoup(
                f'<{n}{h} itemprop="url"></{n}>', "html.parser"
            ).find(n) for n, h in [(
                t.name, '' if t.get("href") is None else f' href="{t.get("href")}"'
            ) for t in soup.select('.fighter.left_side *[itemprop="url"]')]]
  • Related