Home > Back-end >  How to refine and limit BeautifulSoup results
How to refine and limit BeautifulSoup results

Time:09-27

So I'm stuck here. I'm a doctor so my programming background and skills are close to none and most likely that's the problem. I'm trying to learn some basics about Python and for me, the best way is by doing stuff.

The project:

  • scrape the cover images from several books

Some of the links used:

http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html
http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html
http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html
http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html
http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html

That website structure is messed up. The links are located inside a div with class:"post-title entry-title" which in turn has two or more "separator" class div's that can have content or be empty. What I can tell so far is that 95% of the time what I want is the last two links in the first two "separator" class DIV's. And for this stage that's good enough.

My code so far is as follow:

#intro
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')

#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]

#we need a title for each page - for debugging and later used to rename images      
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)

#the find all links loop
for div in separador:
  imagens = div.find_all('a')
  for link in imagens:
    print (link['href'], '\n')

What I can do right now:

  • I can print the right URL's, I can then use wget to download and rename files. However, I only want the last two links from the results and that is the only thing that's missing in my google-fu. I think the problem is in the way BeautifulSoup exports results (ResultSet) and my lack ok knowledge in things such as lists. If the first "separator" has one link and the second two links I get a list with two items (and the second item is two links), hence not slicable.

Example output

2-O Estranho Mundo de Kilsona.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2 - O Estranho Mundo de Kilsona.jpg 

http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg 

http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2 - O Estranho Mundo de Kilsona.jpg 

But I wanted it to be

2-O Estranho Mundo de Kilsona.jpg

http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg 

http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2 - O Estranho Mundo de Kilsona.jpg 

Can anyone shed some light on this?

CodePudding user response:

The issue is due to the line imagens = div.find_all('a') being called within a loop. This creates a list of lists. As such we need to find a way to flatten them into a single list. I do this below with the line merged_list = [] [merged_list.extend(list) for list in imagens].

From here I then create a new list with just the links and then dedupe the list by calling using set (a set is a useful data structure to use when you don't want duplicated data). I then turn It back into a list and it's back to your code.

import requests
from bs4 import BeautifulSoup

link1 = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
link2 = "http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html"
link3 = "http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html"
link4 = "http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html"
link5 = "http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html"


#intro
r=requests.get(link2)
soup = BeautifulSoup(r.content, 'lxml')

#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]

#we need a title for each page - for debugging and later used to rename images      
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)

imagens = [div.find_all('a') for div in separador]
merged_list = []
[merged_list.extend(list) for list in imagens]
link_list = [link['href'] for link in merged_list]
deduped_list = list(set(link_list))
for link in deduped_list:
    print(link, '\n')

CodePudding user response:

You can use CSS selectors to extract image directly from div with class separator (link to docs).

I also use list comprehension instead of for loop.

Below is working example for url from your list.


import requests
from bs4 import BeautifulSoup

#intro
url = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')


#we need a title for each page - for debugging and later used to rename images      
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)

#find all links
links = [link['href'] for link in soup.select('.separator a')]
print(links)
  • Related