Home > Software engineering >  Web Scraping article and image links using python
Web Scraping article and image links using python

Time:11-08

i am trying to web scrape all the links for images, articles from 'https://www.telegraphindia.com/search?keyword=obama&page=2' and save them in an excel file.

i have tried the following code:

url = 'https://www.telegraphindia.com/search?keyword=obama&page=2';
page = requests.get(url);
bsoup = BeautifulSoup(page.content, 'html.parser');

l = bsoup.find_all('a');

for j in l:
    if 'href' in j.attrs:
        print(str(j.attrs['href'])  "\n");

kindly help.

CodePudding user response:

Try this:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}


def get_image_source(div) -> str:
    if div.select_one("a > img").get("data-src"):
        return div.select_one("a > img").get("data-src")
    else:
        return div.select_one("a > img").get("src")


url = 'https://www.telegraphindia.com/search?keyword=obama&page=2'
soup = (
        BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
        .find_all("div", class_="col-5")
)
data = [[div.select_one("a").get("href"), get_image_source(div)] for div in soup]

for row in data[1:]:
    url, image = row
    print(f"https://www.telegraphindia.com{url}\n{image}")
    print("-" * 100)

Output:

https://www.telegraphindia.com/culture/adeles-album-30-makes-one-follow-up-with-frank-sinatras-in-the-wee-small-hours/cid/1839846
https://assets.telegraphindia.com/telegraph/2021/Nov/1637428650_adele0-1.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/my-kolkata/try-this/read/renegades-born-in-the-usa/cid/1838436
https://assets.telegraphindia.com/telegraph/2021/Nov/1636642080_read.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/world/barack-obama-critical-of-xi-putin-for-cop26-snub/cid/1837975
https://assets.telegraphindia.com/telegraph/2020/Nov/1605635180_obama-1.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/opinion/american-diaries-20-years-of-9-11-cafes-launching-pumpkin-spice-lattes-labour-day-in-the-us/cid/1830195
https://assets.telegraphindia.com/telegraph/2021/Sep/1631298515_11editdiarylead.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/opinion/the-us-withdrawal-has-left-afghanistan-in-a-mess/cid/1829113
https://assets.telegraphindia.com/telegraph/2021/Sep/1630608951_3edittop.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/world/lawmakers-urge-biden-to-postpone-full-pullout/cid/1827060
https://assets.telegraphindia.com/telegraph/2021/Feb/1612897964_10biden_new.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/world/as-taliban-capture-cities-us-says-afghan-forces-must-fend-for-themselves/cid/1825951
https://assets.telegraphindia.com/telegraph/2021/Aug/1628541650_1626192574_1625850070_afghan-taliban-1.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/culture/books/books-to-immerse-yourselves-in/cid/1821731
https://assets.telegraphindia.com/telegraph/2021/Jul/1625760902_boo-lead.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/opinion/indias-leaders-must-realize-that-foreign-policy-begins-at-home/cid/1818514
https://assets.telegraphindia.com/telegraph/2021/Jun/1623431345_12edittop.jpg
----------------------------------------------------------------------------------------------------
https://www.telegraphindia.com/opinion/wounded-world-time-for-a-new-compact-with-nature/cid/1815808
https://assets.telegraphindia.com/telegraph/2021/May/1621193814_17edittopnew.jpg
----------------------------------------------------------------------------------------------------

CodePudding user response:

use method .get() for take links

enter image description here

  • Related