Home > Blockchain >  How can I find the src of the first img on a page?
How can I find the src of the first img on a page?

Time:04-22

I'm building a web scraper. The piece of code below works, meaning that it actually finds what I'm looking for, which is the main picture (always the first one) in the article.

picture = []
for item in body.find_all('img'):
    picture.append(item['src'])
    break

Is there a simpler and smoother way to do what I'm doing? I've tried:

picture = body.find('img', ['src'])

Which just returns "None".

CodePudding user response:

Try this:

picture_src = body.find('img').attrs['src']

print(picture_src)

CodePudding user response:

picture = []
for item in body.find_all('img'):
    picture.append(item['src'])
    break

Lets work through it. First off, the "break" is unnecessary. We can just fall off the end of the 'for' loop.

picture = []
for item in body.find_all('img'):
    picture.append(item['src'])

Okay, now the Python convention here is to use a list comprehension like Mathias suggested in the comment.

A list comprehension example:

doubled = [item * 2 for item in [1, 2, 3, 4]]
print(doubled)

Would give:

[2, 4, 6, 8]

Compared to Mathias solution:

pictures = [item['src'] for item in body.find_all('img')]

Note this will fail if there are any img without a src which is also a potential defect in the original solution.

pictures = [item['src'] for item in body.find_all('img') if 'src' in item]

That is actually more complicated then I usually want a single line to be because reading this code later would require a few seconds to think it through. Easy fix if you can trust past you:

# list of all img src attributes
pictures = [item['src'] for item in body.find_all('img') if 'src' in item]
  • Related