Beautifulsoup taking too much time to execute in the code-CodePudding

I am trying to scrape a website:- https://media.info/newspapers/titles This website has a list of newspapers from A to Z. I first have to scrape all the URLs and then scrape some more information from each newspaper.

Below is my code to scrape the URLs of all the newspapers starting from A to Z:-

driver.get('https://media.info/newspapers/titles')
time.sleep(2)
page_title = []
pages = driver.find_elements(By.XPATH,"//div[@class='pages']//a")
for i in pages:
    page_title.append(i.get_attribute("href"))

names = []
for i in page_title:
    driver.get(i)
    time.sleep(1)
    
    name = driver.find_elements(By.XPATH,"//div[@class='info thumbBlock']//a")
    for i in name:
        names.append(i.get_attribute("href"))

len(names) :-> 1688

names[0:5]
['https://media.info/newspapers/titles/abergavenny-chronicle',
 'https://media.info/newspapers/titles/abergavenny-free-press',
 'https://media.info/newspapers/titles/abergavenny-gazette-diary',
 'https://media.info/newspapers/titles/the-abingdon-herald',
 'https://media.info/newspapers/titles/academies-week']

moving further I need to scrape some information like owner, postal_Address, email, etc and I wrote the below code.

test = []
c = 0
for i in names:
    driver.get(i)
    time.sleep(2)
    
    r = requests.get(i)
    soup = BeautifulSoup(r.content,'lxml')
    
    try:
        name = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/div[3]/h1").text

        try:
            twitter = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/a").text
        except:
            twitter = None

        try:
            twitter_followers = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/small").text.replace(' followers','').lstrip('(').rstrip(')')
        except:
            twitter_followers = None
            
        people = []
        try:
            persons = driver.find_elements(By.XPATH,"//div[@class='columns']")
            for i in persons:
                people.append(i.text)
        except:
            people.append(None)

        try:
            owner = soup.select_one('th:contains("Owner")   td').text
        except:
            owner = None

        try:
            postal_address = soup.select_one('th:contains("Postal address")   td').text
        except:
            postal_address = None

        try:
            Telephone = soup.select_one('th:contains("Telephone")   td').text
        except:
            Telephone = None

        try:
            company_website = soup.select_one('th:contains("Official website")   td > a').get('href')
        except:
            company_website = None

        try:
            main_email = soup.select_one('th:contains("Main email")   td').text
        except:
            main_email = None

        try:
            personal_email = soup.select_one('th:contains("Personal email")   td').text
        except:
            personal_email = None

        r2 = requests.get(company_website)
        soup2 = BeautifulSoup(r2.content,'lxml')

        try:
            is_wordpress = soup2.find("meta",{"name":"generator"}).get('content')
        except:
            is_wordpress = None

        news_Data = {
                    "Name": name,
                    "Owner": owner,
                    "Postal Address": postal_address,
                    "main Email":main_email,
                    "Telephone": Telephone, 
                    "Personal Email": personal_email,
                    "Company Wesbite": company_website,
                    "Twitter_Handle": twitter,
                    "Twitter_Followers": twitter_followers,
                    "People":people,
                    "Is Wordpress?":is_wordpress
                    }

        test.append(news_Data)
        c=c 1
        print("completed",c)

    except Exception as Argument:
        print(f"There is an exception with {i}")
        pass

I am using both Selenium and BesutifulSoup with requests to scrape the data. The code is fulfilling the requirements.

Firstly, is it a good practice to use it in this manner like using selenium and soup in the same code?
Secondly, the code is taking too much time. is there any alternate way to reduce the runtime of the code?

CodePudding user response：

BeautifulSoup is not slow: making requests and waiting for responses is slow.

You do not necessarily need selenium/chromedriver setup for this task, it's doable with requests (or other python library).

Yes, there are ways to speed it up, however keep in mind you are making requests to a server, which might become overwhelmed if you send too many requests at once, or it might block you.

Here is an example without selenium, which will accomplish what you're after:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    }

s = requests.Session()
s.headers.update(headers)

r = s.get('https://media.info/newspapers/titles')
soup = bs(r.text)
letter_links = [x.get('href') for x in soup.select_one('div.pages').select('a')]
newspaper_links = []
for x in tqdm(letter_links):
    soup = bs(s.get(x).text)
    ns_links = soup.select_one('div.columns').select('a')
    for n in ns_links:
        newspaper_links.append((n.get_text(strip=True), 'https://media.info/'   n.get('href')))

detailed_infos = []
for x in tqdm(newspaper_links[:50]):
    soup = bs(s.get(x[1]).text)
    owner = soup.select_one('th:contains("Owner")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Owner")') else None
    website = soup.select_one('th:contains("Official website")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Official website")') else None

    detailed_infos.append((x[0], x[1], owner, website))
df = pd.DataFrame(detailed_infos, columns = ['Newspaper', 'Info Url', 'Owner', 'Official website'])
print(df)

Result in terminal:

Newspaper   Info Url    Owner   Official website
0   Abergavenny Chronicle   https://media.info//newspapers/titles/abergavenny-chronicle Tindle Newspapers   abergavenny-chronicle-today.co.uk
1   Abergavenny Free Press  https://media.info//newspapers/titles/abergavenny-free-press    Newsquest Media Group   freepressseries.co.uk
2   Abergavenny Gazette & Diary https://media.info//newspapers/titles/abergavenny-gazette-diary Tindle Newspapers   abergavenny-chronicle-today.co.uk/tn/index.cfm
3   The Abingdon Herald https://media.info//newspapers/titles/the-abingdon-herald   Newsquest Media Group   abingdonherald.co.uk
4   Academies Week  https://media.info//newspapers/titles/academies-week    None    academiesweek.co.uk
5   Accrington Observer https://media.info//newspapers/titles/accrington-observer   Reach plc   accringtonobserver.co.uk
6   Addlestone and Byfleet Review   https://media.info//newspapers/titles/addlestone-and-byfleet-review Reach plc   woking.co.uk
7   Admart & North Devon Diary  https://media.info//newspapers/titles/admart-north-devon-diary  Tindle Newspapers   admart.me.uk
8   AdNews Willenhall, Wednesbury and Darlaston https://media.info//newspapers/titles/adnews-willenhall-wednesbury-and-darlaston    Reach plc   reachplc.com
9   The Advertiser  https://media.info//newspapers/titles/the-advertiser    DMGT    dmgt.co.uk
10  Aintree and Maghull Champion    https://media.info//newspapers/titles/aintree-and-maghull-champion  Champion Media group    champnews.com
11  Airdrie & Coatbridge World  https://media.info//newspapers/titles/airdrie-coatbridge-world  Reach plc   icLanarkshire.co.uk
12  Airdrie and Coatbridge Advertiser   https://media.info//newspapers/titles/airdrie-and-coatbridge-advertiser Reach plc   acadvertiser.co.uk
13  Aire Valley Target  https://media.info//newspapers/titles/aire-valley-target    Newsquest Media Group   thisisbradford.co.uk
14  Alcester Chronicle  https://media.info//newspapers/titles/alcester-chronicle    Newsquest Media Group   redditchadvertiser.co.uk/news/alcester
15  Alcester Standard   https://media.info//newspapers/titles/alcester-standard Bullivant Media redditchstandard.co.uk
16  Aldershot Courier   https://media.info//newspapers/titles/aldershot-courier Guardian Media Group    aldershot.co.uk
17  Aldershot Mail  https://media.info//newspapers/titles/aldershot-mail    Guardian Media Group    aldershot.co.uk
18  Aldershot News & Mail   https://media.info//newspapers/titles/aldershot-news-mail   Reach plc   gethampshire.co.uk/aldershot
19  Alford Standard https://media.info//newspapers/titles/alford-standard   JPI Media   skegnessstandard.co.uk
20  Alford Target   https://media.info//newspapers/titles/alford-target DMGT    dmgt.co.uk
21  Alfreton and Ripley Echo    https://media.info//newspapers/titles/alfreton-and-ripley-echo  JPI Media   jpimedia.co.uk
22  Alfreton Chad   https://media.info//newspapers/titles/alfreton-chad JPI Media   chad.co.uk
23  All at Sea  https://media.info//newspapers/titles/all-at-sea    None    allatsea.co.uk
24  Allanwater News https://media.info//newspapers/titles/allanwater-news   HUB Media   allanwaternews.co.uk
25  Alloa & Hillfoots Shopper   https://media.info//newspapers/titles/alloa-hillfoots-shopper   Reach plc   reachplc.com
26  Alloa & Hillfoots Advertiser    https://media.info//newspapers/titles/alloa-hillfoots-advertiser    Dunfermline Press Group alloaadvertiser.com
27  Alloa and Hillfoots Wee County News https://media.info//newspapers/titles/alloa-and-hillfoots-wee-county-news   HUB Media   wee-county-news.co.uk
28  Alton Diary https://media.info//newspapers/titles/alton-diary   Tindle Newspapers   tindlenews.co.uk
29  Andersonstown News  https://media.info//newspapers/titles/andersonstown-news    Belfast Media Group irelandclick.com
30  Andover Advertiser  https://media.info//newspapers/titles/andover-advertiser    Newsquest Media Group   andoveradvertiser.co.uk
31  Anfield and Walton Star https://media.info//newspapers/titles/anfield-and-walton-star   Reach plc   icliverpool.co.uk
32  The Anglo-Celt  https://media.info//newspapers/titles/the-anglo-celt    None    anglocelt.ie
33  Annandale Herald    https://media.info//newspapers/titles/annandale-herald  Dumfriesshire Newspaper Group   dng24.co.uk
34  Annandale Observer  https://media.info//newspapers/titles/annandale-observer    Dumfriesshire Newspaper Group   dng24.co.uk
35  Antrim Times    https://media.info//newspapers/titles/antrim-times  JPI Media   antrimtoday.co.uk
36  Arbroath Herald https://media.info//newspapers/titles/arbroath-herald   JPI Media   arbroathherald.com
37  The Arden Observer  https://media.info//newspapers/titles/the-arden-observer    Bullivant Media ardenobserver.co.uk
38  Ardrossan & Saltcoats Herald    https://media.info//newspapers/titles/ardrossan-saltcoats-herald    Newsquest Media Group   ardrossanherald.com
39  The Argus   https://media.info//newspapers/titles/the-argus Newsquest Media Group   theargus.co.uk
40  Argyllshire Advertiser  https://media.info//newspapers/titles/argyllshire-advertiser    Oban Times Group    argyllshireadvertiser.co.uk
41  Armthorpe Community Newsletter  https://media.info//newspapers/titles/armthorpe-community-newsletter    JPI Media   jpimedia.co.uk
42  The Arran Banner    https://media.info//newspapers/titles/the-arran-banner  Oban Times Group    arranbanner.co.uk
43  The Arran Voice https://media.info//newspapers/titles/the-arran-voice   Independent News Ltd    voiceforarran.com
44  The Art Newspaper   https://media.info//newspapers/titles/the-art-newspaper None    theartnewspaper.com
45  Ashbourne News Telegraph    https://media.info//newspapers/titles/ashbourne-news-telegraph  Reach plc   ashbournenewstelegraph.co.uk
46  Ashby Echo  https://media.info//newspapers/titles/ashby-echo    Reach plc   reachplc.com
47  Ashby Mail  https://media.info//newspapers/titles/ashby-mail    DMGT    thisisleicestershire.co.uk
48  Ashfield Chad   https://media.info//newspapers/titles/ashfield-chad JPI Media   chad.co.uk
49  Ashford Adscene https://media.info//newspapers/titles/ashford-adscene   DMGT    thisiskent.co.uk

You can extract more information for each newspaper, as you wish - the above is just an example, going through the first 50 newspapers. Now if you want a multithreaded/async solution, I recommend you read the following, and apply it to your own scenario: BeautifulSoup getting href of a list - need to simplify the script - replace multiprocessing

Lastly, Requests docs can be found here: https://requests.readthedocs.io/en/latest/

BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

For TQDM: https://pypi.org/project/tqdm/

CodePudding user response：

names = []
for letter in string.ascii_lowercase:
    page = requests.get("https://media.info/newspapers/titles/starting-with/{}".format(letter))
    soup = BeautifulSoup(page.content, "html.parser")
    for i in soup.find_all("a"):
        if i['href'].startswith("/newspapers/titles/"):
            names.append(i['href'])