Home > OS >  How would one scrape <figure> tags in bs4?
How would one scrape <figure> tags in bs4?

Time:04-15

I am trying to scrape images off of https://nytimes.com, however , most of the main headlines' corresponding images on their website is stored inside a <figure> tag, not an <img> tag with a specific src attribute.

How would i be able to scrape the urls for the images inside those <figure> tags so i'd then be able to aggregate them on my own website?

CodePudding user response:

As the url is dynamic, you can get the main headline's all image urls using selenium with BeautifulSoup.

from selenium import webdriver
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager

data=[]
driver = webdriver.Chrome(ChromeDriverManager().install())
url='https://www.nytimes.com/'
driver.get(url)
driver.maximize_window()
soup=BeautifulSoup(driver.page_source,'html.parser')
driver.close()

for im in soup.select('.css-cov0u6 img'):
    img=im.get('src')
    data.append(img)
    #print(img)
print(data)

Output:

https://static01.nyt.com/images/2022/04/14/multimedia/14musk-twitter/14musk-twitter-threeByTwoMediumAt2X-v2.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/nyregion/14nyshooting/merlin_205419441_07391422-eea0-4436-97e3-c253e755010a-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/climate/00virus-case-counts1/00virus-case-counts1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/01/world/00africa-france-4/merlin_188413827_06ae2d07-ecd5-4090-ba71-815f5faee66b-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/opinion/14spiers-image/14spiers-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/opinion/14reinhart-main/14reinhart-main-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/realestate/14HUNT-WINTHUR1/14HUNT-WINTHUR1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/world/14japan-toddlers1/14japan-toddlers1-threeByTwoMediumAt2X-v3.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/magazine/17mag-studies_01/17mag-studies_01-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/opinion/13coy-image/13coy-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/climate/12cli-newsletter-cup-still/12cli-newsletter-cup-still-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/opinion/12krugman_newsletter_1/12krugman_newsletter_1-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/opinion/12McWhorter-image/12McWhorter-image-square320.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/climate/14cli-cactus1/14cli-cactus1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/well/00well-mental-apps/00well-mental-apps-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/12/well/12WELL-USPSTF-SCREENING2/merlin_181619943_befc32d6-5803-4885-9c6c-4369f50d80ae-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/08/10/well/06ASKWELL-ADHD1/06ASKWELL-ADHD1-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2019/05/07/parenting/07-parenting-postpartumdep/07-parenting-postpartumdep-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2017/04/11/science/physed-breathing/physed-breathing-videoSixteenByNine1050.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/17/arts/17skarsgard1/17skarsgard1-threeByTwoMediumAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/03/01/world/00Israel-Art01/merlin_202656597_dc718c26-d9ff-45c5-a300-90800c78ac10-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/fashion/14ASHLEY1/14ASHLEY1-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/14/arts/13Fidelio-deaf-9/17INTIMACY-BALLET-9-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/04/13/dining/08Appe1/merlin_204759306_c259077b-a1ec-47ac-bb51-4113304a3282-threeByTwoSmallAt2X.jpg?format=pjpg&quality=75&auto=webp&disable=upscale    
https://static01.nyt.com/images/2019/04/18/homepage/spelling-bee-logo-bulletin/spelling-bee-logo-bulletin-square320-v5.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2022/03/02/crosswords/alpha-wordle-icon-new/alpha-wordle-icon-new-square320-v2.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2020/03/23/crosswords/crossword-logo-nytgames-hires/crossword-logo-nytgames-hires-square320-v3.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/08/03/crosswords/nyt-games-homepage-playmodule-subscribe/nyt-games-homepage-playmodule-subscribe-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2021/05/27/multimedia/alpha-letterboxed-promo-1622145789727/alpha-letterboxed-promo-1622145789727-square320.png?format=pjpg&quality=75&auto=webp&disable=upscale
https://static01.nyt.com/images/2020/03/23/crosswords/tiles-logo-nytgames-hi-res/tiles-logo-nytgames-hi-res-square320-v4.png?format=pjpg&quality=75&auto=webp&disable=upscale

webdriver-manager

  • Related