I'm building a web scraper. The piece of code below works, meaning that it actually finds what I'm looking for, which is the main picture (always the first one) in the article.
picture = []
for item in body.find_all('img'):
picture.append(item['src'])
break
Is there a simpler and smoother way to do what I'm doing? I've tried:
picture = body.find('img', ['src'])
Which just returns "None".
CodePudding user response:
Try this:
picture_src = body.find('img').attrs['src']
print(picture_src)
CodePudding user response:
picture = []
for item in body.find_all('img'):
picture.append(item['src'])
break
Lets work through it. First off, the "break" is unnecessary. We can just fall off the end of the 'for' loop.
picture = []
for item in body.find_all('img'):
picture.append(item['src'])
Okay, now the Python convention here is to use a list comprehension like Mathias suggested in the comment.
A list comprehension example:
doubled = [item * 2 for item in [1, 2, 3, 4]]
print(doubled)
Would give:
[2, 4, 6, 8]
Compared to Mathias solution:
pictures = [item['src'] for item in body.find_all('img')]
Note this will fail if there are any img without a src which is also a potential defect in the original solution.
pictures = [item['src'] for item in body.find_all('img') if 'src' in item]
That is actually more complicated then I usually want a single line to be because reading this code later would require a few seconds to think it through. Easy fix if you can trust past you:
# list of all img src attributes
pictures = [item['src'] for item in body.find_all('img') if 'src' in item]