I am trying to extract some information in a pdf embedded in a web page using python and requests, And this is exactly the sentence I want to reach « Sciences de la vie et de l’environnement ».
Here is the code you wrote :
import time
import requests
from bs4 import BeautifulSoup
# website to scrap
url = "https://fs.uit.ac.ma/avis-de-soutenance-dune-these-de-doctorat-mme-achachi-hind/"
with requests.session() as s:
# get the url from requests get method
html_content = s.get(url, verify=False)
# Parse the html content
soup = BeautifulSoup(html_content.content, "html.parser")
url2 = soup.iframe["src"]
html_doc = s.get(url2, verify=False).text
print(html_doc)
Here's some of what print(html_doc),
When comparing the two pictures, I can't see what's inside in the last picture :
<div id="viewer" ></div>
Where inside this line is the writing that I want :
CodePudding user response:
You can access the PDF manually (https://fs.uit.ac.ma/wp-content/uploads/2022/02/AVIS-DE-SOUTENANCE-ACHACHI-HIND.pdf) . There is the url in the iframe and request. In case of there is no way to get the url from the source code, you have to scrape the requests (eg. with BrowserMob)