Home > front end >  Extract URLs from a website ( The Hindu) which uses google search console using python
Extract URLs from a website ( The Hindu) which uses google search console using python

Time:12-07

I'm trying to extract links from website using beautiful soup.The website link is https://www.thehindu.com/search/?q=central vista&sort=relevance&start=#gsc.tab=0&gsc.q=central vista&gsc.page=1

The code which i used is given below

import requests
from bs4 import BeautifulSoup
url=[]

url = 'https://www.thehindu.com/search/?q=central vista&sort=relevance&start=#gsc.tab=0&gsc.q=central vista&gsc.page=1'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

urls = []
for link in soup.find_all('a'):
    print(link.get('href'))
    urls.append(link.get('href'))

The code runs and gives all the urls present in the website except the one present in the google search console which is the required part.I am basically stuck. Can someone help me to sort it out?

CodePudding user response:

The data you see is loaded with JavaScript, so beautifulsoup doesn't see it. You can use requests re/json modules to get the data:

import re
import json
import requests

url = "https://cse.google.com/cse/element/v1"

params = {
    "rsz": "filtered_cse",
    "num": "10",
    "hl": "sk",
    "source": "gcsc",
    "gss": ".com",
    "cselibv": "f275a300093f201a",
    "cx": "264d7caeb1ba04bfc",
    "q": "central vista",
    "safe": "active",
    "cse_tok": "AB1-RNWPlN01WUQgebV0g3LpWU6l:1670351743367",
    "lr": "",
    "cr": "",
    "gl": "",
    "filter": "0",
    "sort": "",
    "as_oq": "",
    "as_sitesearch": "",
    "exp": "csqr,cc,4861326",
    "callback": "google.search.cse.api3099",
}

data = requests.get(url, params=params).text
data = re.search(r"(?s)\((.*)\)", data).group(1)
data = json.loads(data)

for r in data["results"]:
    print(r["url"])

Prints:

https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece
https://www.thehindu.com/news/national/central-vista-project-sc-dismisses-plea-against-delhi-hc-verdict-refusing-to-halt-work/article35031575.ece
https://www.thehindu.com/opinion/editorial/monumental-hurry-on-central-vista-project/article31734021.ece
https://www.thehindu.com/news/national/central-vista-new-buildings-on-kg-marg-africa-avenue-proposed-for-relocating-govt-offices/article31702494.ece
https://www.thehindu.com/society/beyond-the-veils-of-secrecy-the-central-vista-project-is-both-the-cause-and-effect-of-its-own-multiple-failures/article32980560.ece
https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece?homepage=true
https://www.thehindu.com/news/national/work-on-new-parliament-central-vista-avenue-projects-on-track/article36296821.ece
https://www.thehindu.com/news/national/2466-trees-removed-for-central-vista-projects-so-far-govt/article65665595.ece
https://www.thehindu.com/news/national/central-vista-avenue-redevelopment-project-to-be-completed-by-july-18-puri/article65611471.ece?homepage=true
https://www.thehindu.com/news/national/central-vista-jharkhand-firm-is-lowest-bidder-for-vice-president-enclave/article37310541.ece
  • Related