images = []
try:
images = [img["src"] for img in soup.select(".img-lazy")]
except:
images = [img["src"] for img in soup.select(".img-thumb")]
else:
images = [img["src"] for img in soup.select(".author-bio")]
I try to scrape image src from different pages. If works fine only with try and except, but some pages have image in different class name so I add another except condition. It shows error then I add else condition. But now it only scrapes else condition data. I want that first it look for .img-lazy then for .img-thumb and in last for .author-bio class.
CodePudding user response:
First of all: You should (almost) never use bare except
clauses like that. Details about this are all over this platform.
In this case, you are shooting yourself in the foot, because you can't know what exception exactly is raised when that except
is triggered.
Also, with this logic, whenever the code inside the try
block executes without a problem (thus assigning the images
variable), the except
block is skipped and then the else
block is executed. This results in the images
variable being re-assigned (i.e. overwritten) in that block.
This is the logic behind try-except-else
constructs. (You should read up on that.)
If I understand your requirements and the documentation for select
correctly, you can just do this instead of that whole try-except
mess:
images = [img["src"] for img in soup.select(".img-lazy,.img-thumb,.author-bio")]
That select
call should return you all elements that match any of those class selectors.
However, I would be careful here, unless you know for certain that every HTML element with any of those classes is in fact an <img>
(or more specifically, has a src
attribute). Because if any of them does not have src
attribute, that code will raise a KeyError
at this point: img["src"]
I would suggest being as precise as possible with the selector:
images = [
img["src"]
for img in soup.select(
"img[src].img-lazy,img[src].img-thumb,img[src].author-bio"
)
]
For example, this img[src].img-lazy
will only grab <img>
tags that have a src
attribute and the class img-lazy
.