I am trying to scrape a website:- https://media.info/newspapers/titles This website has a list of newspapers from A to Z. I first have to scrape all the URLs and then scrape some more information from each newspaper.
Below is my code to scrape the URLs of all the newspapers starting from A to Z:-
driver.get('https://media.info/newspapers/titles')
time.sleep(2)
page_title = []
pages = driver.find_elements(By.XPATH,"//div[@class='pages']//a")
for i in pages:
page_title.append(i.get_attribute("href"))
names = []
for i in page_title:
driver.get(i)
time.sleep(1)
name = driver.find_elements(By.XPATH,"//div[@class='info thumbBlock']//a")
for i in name:
names.append(i.get_attribute("href"))
len(names) :-> 1688
names[0:5]
['https://media.info/newspapers/titles/abergavenny-chronicle',
'https://media.info/newspapers/titles/abergavenny-free-press',
'https://media.info/newspapers/titles/abergavenny-gazette-diary',
'https://media.info/newspapers/titles/the-abingdon-herald',
'https://media.info/newspapers/titles/academies-week']
moving further I need to scrape some information like owner, postal_Address, email, etc and I wrote the below code.
test = []
c = 0
for i in names:
driver.get(i)
time.sleep(2)
r = requests.get(i)
soup = BeautifulSoup(r.content,'lxml')
try:
name = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/div[3]/h1").text
try:
twitter = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/a").text
except:
twitter = None
try:
twitter_followers = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/small").text.replace(' followers','').lstrip('(').rstrip(')')
except:
twitter_followers = None
people = []
try:
persons = driver.find_elements(By.XPATH,"//div[@class='columns']")
for i in persons:
people.append(i.text)
except:
people.append(None)
try:
owner = soup.select_one('th:contains("Owner") td').text
except:
owner = None
try:
postal_address = soup.select_one('th:contains("Postal address") td').text
except:
postal_address = None
try:
Telephone = soup.select_one('th:contains("Telephone") td').text
except:
Telephone = None
try:
company_website = soup.select_one('th:contains("Official website") td > a').get('href')
except:
company_website = None
try:
main_email = soup.select_one('th:contains("Main email") td').text
except:
main_email = None
try:
personal_email = soup.select_one('th:contains("Personal email") td').text
except:
personal_email = None
r2 = requests.get(company_website)
soup2 = BeautifulSoup(r2.content,'lxml')
try:
is_wordpress = soup2.find("meta",{"name":"generator"}).get('content')
except:
is_wordpress = None
news_Data = {
"Name": name,
"Owner": owner,
"Postal Address": postal_address,
"main Email":main_email,
"Telephone": Telephone,
"Personal Email": personal_email,
"Company Wesbite": company_website,
"Twitter_Handle": twitter,
"Twitter_Followers": twitter_followers,
"People":people,
"Is Wordpress?":is_wordpress
}
test.append(news_Data)
c=c 1
print("completed",c)
except Exception as Argument:
print(f"There is an exception with {i}")
pass
I am using both Selenium and BesutifulSoup with requests to scrape the data. The code is fulfilling the requirements.
- Firstly, is it a good practice to use it in this manner like using selenium and soup in the same code?
- Secondly, the code is taking too much time. is there any alternate way to reduce the runtime of the code?
CodePudding user response:
BeautifulSoup is not slow: making requests and waiting for responses is slow.
You do not necessarily need selenium/chromedriver setup for this task, it's doable with requests (or other python library).
Yes, there are ways to speed it up, however keep in mind you are making requests to a server, which might become overwhelmed if you send too many requests at once, or it might block you.
Here is an example without selenium, which will accomplish what you're after:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
r = s.get('https://media.info/newspapers/titles')
soup = bs(r.text)
letter_links = [x.get('href') for x in soup.select_one('div.pages').select('a')]
newspaper_links = []
for x in tqdm(letter_links):
soup = bs(s.get(x).text)
ns_links = soup.select_one('div.columns').select('a')
for n in ns_links:
newspaper_links.append((n.get_text(strip=True), 'https://media.info/' n.get('href')))
detailed_infos = []
for x in tqdm(newspaper_links[:50]):
soup = bs(s.get(x[1]).text)
owner = soup.select_one('th:contains("Owner")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Owner")') else None
website = soup.select_one('th:contains("Official website")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Official website")') else None
detailed_infos.append((x[0], x[1], owner, website))
df = pd.DataFrame(detailed_infos, columns = ['Newspaper', 'Info Url', 'Owner', 'Official website'])
print(df)
Result in terminal:
Newspaper Info Url Owner Official website
0 Abergavenny Chronicle https://media.info//newspapers/titles/abergavenny-chronicle Tindle Newspapers abergavenny-chronicle-today.co.uk
1 Abergavenny Free Press https://media.info//newspapers/titles/abergavenny-free-press Newsquest Media Group freepressseries.co.uk
2 Abergavenny Gazette & Diary https://media.info//newspapers/titles/abergavenny-gazette-diary Tindle Newspapers abergavenny-chronicle-today.co.uk/tn/index.cfm
3 The Abingdon Herald https://media.info//newspapers/titles/the-abingdon-herald Newsquest Media Group abingdonherald.co.uk
4 Academies Week https://media.info//newspapers/titles/academies-week None academiesweek.co.uk
5 Accrington Observer https://media.info//newspapers/titles/accrington-observer Reach plc accringtonobserver.co.uk
6 Addlestone and Byfleet Review https://media.info//newspapers/titles/addlestone-and-byfleet-review Reach plc woking.co.uk
7 Admart & North Devon Diary https://media.info//newspapers/titles/admart-north-devon-diary Tindle Newspapers admart.me.uk
8 AdNews Willenhall, Wednesbury and Darlaston https://media.info//newspapers/titles/adnews-willenhall-wednesbury-and-darlaston Reach plc reachplc.com
9 The Advertiser https://media.info//newspapers/titles/the-advertiser DMGT dmgt.co.uk
10 Aintree and Maghull Champion https://media.info//newspapers/titles/aintree-and-maghull-champion Champion Media group champnews.com
11 Airdrie & Coatbridge World https://media.info//newspapers/titles/airdrie-coatbridge-world Reach plc icLanarkshire.co.uk
12 Airdrie and Coatbridge Advertiser https://media.info//newspapers/titles/airdrie-and-coatbridge-advertiser Reach plc acadvertiser.co.uk
13 Aire Valley Target https://media.info//newspapers/titles/aire-valley-target Newsquest Media Group thisisbradford.co.uk
14 Alcester Chronicle https://media.info//newspapers/titles/alcester-chronicle Newsquest Media Group redditchadvertiser.co.uk/news/alcester
15 Alcester Standard https://media.info//newspapers/titles/alcester-standard Bullivant Media redditchstandard.co.uk
16 Aldershot Courier https://media.info//newspapers/titles/aldershot-courier Guardian Media Group aldershot.co.uk
17 Aldershot Mail https://media.info//newspapers/titles/aldershot-mail Guardian Media Group aldershot.co.uk
18 Aldershot News & Mail https://media.info//newspapers/titles/aldershot-news-mail Reach plc gethampshire.co.uk/aldershot
19 Alford Standard https://media.info//newspapers/titles/alford-standard JPI Media skegnessstandard.co.uk
20 Alford Target https://media.info//newspapers/titles/alford-target DMGT dmgt.co.uk
21 Alfreton and Ripley Echo https://media.info//newspapers/titles/alfreton-and-ripley-echo JPI Media jpimedia.co.uk
22 Alfreton Chad https://media.info//newspapers/titles/alfreton-chad JPI Media chad.co.uk
23 All at Sea https://media.info//newspapers/titles/all-at-sea None allatsea.co.uk
24 Allanwater News https://media.info//newspapers/titles/allanwater-news HUB Media allanwaternews.co.uk
25 Alloa & Hillfoots Shopper https://media.info//newspapers/titles/alloa-hillfoots-shopper Reach plc reachplc.com
26 Alloa & Hillfoots Advertiser https://media.info//newspapers/titles/alloa-hillfoots-advertiser Dunfermline Press Group alloaadvertiser.com
27 Alloa and Hillfoots Wee County News https://media.info//newspapers/titles/alloa-and-hillfoots-wee-county-news HUB Media wee-county-news.co.uk
28 Alton Diary https://media.info//newspapers/titles/alton-diary Tindle Newspapers tindlenews.co.uk
29 Andersonstown News https://media.info//newspapers/titles/andersonstown-news Belfast Media Group irelandclick.com
30 Andover Advertiser https://media.info//newspapers/titles/andover-advertiser Newsquest Media Group andoveradvertiser.co.uk
31 Anfield and Walton Star https://media.info//newspapers/titles/anfield-and-walton-star Reach plc icliverpool.co.uk
32 The Anglo-Celt https://media.info//newspapers/titles/the-anglo-celt None anglocelt.ie
33 Annandale Herald https://media.info//newspapers/titles/annandale-herald Dumfriesshire Newspaper Group dng24.co.uk
34 Annandale Observer https://media.info//newspapers/titles/annandale-observer Dumfriesshire Newspaper Group dng24.co.uk
35 Antrim Times https://media.info//newspapers/titles/antrim-times JPI Media antrimtoday.co.uk
36 Arbroath Herald https://media.info//newspapers/titles/arbroath-herald JPI Media arbroathherald.com
37 The Arden Observer https://media.info//newspapers/titles/the-arden-observer Bullivant Media ardenobserver.co.uk
38 Ardrossan & Saltcoats Herald https://media.info//newspapers/titles/ardrossan-saltcoats-herald Newsquest Media Group ardrossanherald.com
39 The Argus https://media.info//newspapers/titles/the-argus Newsquest Media Group theargus.co.uk
40 Argyllshire Advertiser https://media.info//newspapers/titles/argyllshire-advertiser Oban Times Group argyllshireadvertiser.co.uk
41 Armthorpe Community Newsletter https://media.info//newspapers/titles/armthorpe-community-newsletter JPI Media jpimedia.co.uk
42 The Arran Banner https://media.info//newspapers/titles/the-arran-banner Oban Times Group arranbanner.co.uk
43 The Arran Voice https://media.info//newspapers/titles/the-arran-voice Independent News Ltd voiceforarran.com
44 The Art Newspaper https://media.info//newspapers/titles/the-art-newspaper None theartnewspaper.com
45 Ashbourne News Telegraph https://media.info//newspapers/titles/ashbourne-news-telegraph Reach plc ashbournenewstelegraph.co.uk
46 Ashby Echo https://media.info//newspapers/titles/ashby-echo Reach plc reachplc.com
47 Ashby Mail https://media.info//newspapers/titles/ashby-mail DMGT thisisleicestershire.co.uk
48 Ashfield Chad https://media.info//newspapers/titles/ashfield-chad JPI Media chad.co.uk
49 Ashford Adscene https://media.info//newspapers/titles/ashford-adscene DMGT thisiskent.co.uk
You can extract more information for each newspaper, as you wish - the above is just an example, going through the first 50 newspapers. Now if you want a multithreaded/async solution, I recommend you read the following, and apply it to your own scenario: BeautifulSoup getting href of a list - need to simplify the script - replace multiprocessing
Lastly, Requests docs can be found here: https://requests.readthedocs.io/en/latest/
BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
For TQDM: https://pypi.org/project/tqdm/
CodePudding user response:
names = []
for letter in string.ascii_lowercase:
page = requests.get("https://media.info/newspapers/titles/starting-with/{}".format(letter))
soup = BeautifulSoup(page.content, "html.parser")
for i in soup.find_all("a"):
if i['href'].startswith("/newspapers/titles/"):
names.append(i['href'])