I'm trying to extract links from website using beautiful soup.The website link is https://www.thehindu.com/search/?q=central vista&sort=relevance&start=#gsc.tab=0&gsc.q=central vista&gsc.page=1
The code which i used is given below
import requests
from bs4 import BeautifulSoup
url=[]
url = 'https://www.thehindu.com/search/?q=central vista&sort=relevance&start=#gsc.tab=0&gsc.q=central vista&gsc.page=1'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
urls.append(link.get('href'))
The code runs and gives all the urls present in the website except the one present in the google search console which is the required part.I am basically stuck. Can someone help me to sort it out?
CodePudding user response:
The data you see is loaded with JavaScript, so beautifulsoup
doesn't see it. You can use requests
re
/json
modules to get the data:
import re
import json
import requests
url = "https://cse.google.com/cse/element/v1"
params = {
"rsz": "filtered_cse",
"num": "10",
"hl": "sk",
"source": "gcsc",
"gss": ".com",
"cselibv": "f275a300093f201a",
"cx": "264d7caeb1ba04bfc",
"q": "central vista",
"safe": "active",
"cse_tok": "AB1-RNWPlN01WUQgebV0g3LpWU6l:1670351743367",
"lr": "",
"cr": "",
"gl": "",
"filter": "0",
"sort": "",
"as_oq": "",
"as_sitesearch": "",
"exp": "csqr,cc,4861326",
"callback": "google.search.cse.api3099",
}
data = requests.get(url, params=params).text
data = re.search(r"(?s)\((.*)\)", data).group(1)
data = json.loads(data)
for r in data["results"]:
print(r["url"])
Prints:
https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece
https://www.thehindu.com/news/national/central-vista-project-sc-dismisses-plea-against-delhi-hc-verdict-refusing-to-halt-work/article35031575.ece
https://www.thehindu.com/opinion/editorial/monumental-hurry-on-central-vista-project/article31734021.ece
https://www.thehindu.com/news/national/central-vista-new-buildings-on-kg-marg-africa-avenue-proposed-for-relocating-govt-offices/article31702494.ece
https://www.thehindu.com/society/beyond-the-veils-of-secrecy-the-central-vista-project-is-both-the-cause-and-effect-of-its-own-multiple-failures/article32980560.ece
https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece?homepage=true
https://www.thehindu.com/news/national/work-on-new-parliament-central-vista-avenue-projects-on-track/article36296821.ece
https://www.thehindu.com/news/national/2466-trees-removed-for-central-vista-projects-so-far-govt/article65665595.ece
https://www.thehindu.com/news/national/central-vista-avenue-redevelopment-project-to-be-completed-by-july-18-puri/article65611471.ece?homepage=true
https://www.thehindu.com/news/national/central-vista-jharkhand-firm-is-lowest-bidder-for-vice-president-enclave/article37310541.ece