I re-edited my question so it's explaining my problem better
I'm trying to get name and last name of a person from ORCID database (db of scientific articles and authors).
I use requests_html
and .render()
to access url:
"https://orcid.org/orcid-search/search?searchQuery=0000-0001-9077-1041" and get html code out of it. Html is parsed and stored in _text
list. (if you access the url yoou'll see that it contains search results of ORCID db by id "0000-0001-9077-1041" - name: "Andreas" and last name: "Leimbach" as well as some additional data).
I want to retrieve name and last name text from the html code of that page. However when I ran the program multiple times, sometimes name and last name are in output results and sometimes they are not. I expect program to allways retrieve the same data.
I use the following Python script:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
def GetCredentialsFromORCID(_id):
base_url = "https://orcid.org/orcid-search/search?searchQuery=" _id
session = HTMLSession()
response = session.get(base_url)
response.html.render()
soup = BeautifulSoup(response.html.html, 'lxml')
_text = soup.get_text().strip().split()
print("This is whet we got:\n", _text)
GetCredentialsFromORCID("0000-0001-9077-1041")
(Try running this code few times (5 - 10 times) and see for yourself).
I can only assume thet it may have something to do with the fact that this page uses JavaScript cause I keep recieveing:
Please enable JavaScript to continue using this application.
in the console but I don't know much about it.
Can anyone help me with that ?
CodePudding user response:
The webpage actually goes on to run an expanded search following the initial search. You can re-write your code to use that expanded search as the initial call and then you only need requests. You can certainly re-work example below. It is simply structured as your original was in terms of accepting an id and returning a response. Minimal error handling included.
def GetCredentialsFromORCID(_id):
import requests
r = requests.get(f'https://pub.orcid.org/v3.0/expanded-search/?start=0&rows=200&q=orcid:{_id}',
headers = {'User-Agent':'Mozilla/5.0', 'accept' : 'application/json'})
try:
return r.json()
except Exception as e:
return (f'Error for {_id}', e)
print(GetCredentialsFromORCID("0000-0001-9077-1041"))