Home > Back-end >  Extracting credentials from ORCID seach with ORCID id using python
Extracting credentials from ORCID seach with ORCID id using python

Time:02-14

I re-edited my question so it's explaining my problem better

I'm trying to get name and last name of a person from ORCID database (db of scientific articles and authors).

I use requests_html and .render() to access url:

"https://orcid.org/orcid-search/search?searchQuery=0000-0001-9077-1041" and get html code out of it. Html is parsed and stored in _text list. (if you access the url yoou'll see that it contains search results of ORCID db by id "0000-0001-9077-1041" - name: "Andreas" and last name: "Leimbach" as well as some additional data).

I want to retrieve name and last name text from the html code of that page. However when I ran the program multiple times, sometimes name and last name are in output results and sometimes they are not. I expect program to allways retrieve the same data.

I use the following Python script:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

def GetCredentialsFromORCID(_id):
    base_url = "https://orcid.org/orcid-search/search?searchQuery="   _id
    session = HTMLSession()
    response = session.get(base_url)
    response.html.render()
    
    soup = BeautifulSoup(response.html.html, 'lxml')
    _text = soup.get_text().strip().split()
    print("This is whet we got:\n", _text)

GetCredentialsFromORCID("0000-0001-9077-1041")

(Try running this code few times (5 - 10 times) and see for yourself).

I can only assume thet it may have something to do with the fact that this page uses JavaScript cause I keep recieveing:

Please enable JavaScript to continue using this application.

in the console but I don't know much about it.

Can anyone help me with that ?

CodePudding user response:

The webpage actually goes on to run an expanded search following the initial search. You can re-write your code to use that expanded search as the initial call and then you only need requests. You can certainly re-work example below. It is simply structured as your original was in terms of accepting an id and returning a response. Minimal error handling included.

def GetCredentialsFromORCID(_id):
    import requests
    
    r = requests.get(f'https://pub.orcid.org/v3.0/expanded-search/?start=0&rows=200&q=orcid:{_id}',
                    headers = {'User-Agent':'Mozilla/5.0', 'accept' : 'application/json'})
    try:
        return r.json()
    except Exception as e:
        return (f'Error for {_id}', e)
                            

print(GetCredentialsFromORCID("0000-0001-9077-1041"))
  • Related