Home > Back-end >  Web scraping with python in javascript dynamic website
Web scraping with python in javascript dynamic website

Time:10-13

I need to scarping all article, title of article and paragraf in this web: https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19

The problem is than I tried some of div, h3 or p nothing happen add image.

from bs4 import BeautifulSoup
import lxml
import pandas as pd
from tqdm import tqdm_notebook


def parse_url(url):
    response = requests.get(url)
    content = response.content
    parsed_response = BeautifulSoup(content, "lxml")
    return parsed_response


url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"

soup = parse_url(url)


article = soup.find("div", {"class":"article-document"})

article

It seems to be a website with javascript, but I don't know how to get it.

CodePudding user response:

The website does 3 API calls in order to get the data.
The code below does the same and get the data.

(In the browser do F12 -> Network -> XHR and see the API calls)

import requests

payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
  print(r1.json())

print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
  print(r2.json())

print('------------------')

payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
  print(r3.json())
  • Related