I want to extract the values of scientific publications from the openalex API. However, since this API does not have complete values for all publications, the resulting JSON file is not always complete. If the file is complete, my code will run without issues. If the API does not have all information available, it can happen that the following result is found but cannot get interpreted: "institutions":[] instead of "institutions":[{"id":"https://openalex.org/I2057...}{...}]. As a result, I always get an "IndexError: list index out of range".
After an extensive search, I have already tried to solve the problem with the help of try / except or if-queries (if required, I can also provide them). Unfortunately, I did not succeed.
My goal is that in the charlist, in places where no information is available ([]), None or Null is entered. The goal is to program the code as performant as possible since I will have a high six-digit number of requests. This is, of course, already cleared with the API operator.
My code listed below already works for complete JSON files (upper magid_list) but not for incomplete entries (2301544176) as in the lower, not commented-out magid_list.
import requests
import json
baseurl = 'https://api.openalex.org/works?filter=ids.mag:'
#**upper magid_listworks without problems**
#magid_list = [2301543590, 2301543835]
#**error occur**
#**see page "https://api.openalex.org/works?filter=ids.mag:2301544176" no information for institution given**
magid_list = [2301543590, 2301543835, 2301544176]
def main_request(baseurl, endpoint):
r = requests.get(baseurl endpoint)
return r.json()
def parse_json(response):
charlist = []
pupdate = data['results'][0]['publication_date']
display_name = data['results'][0]['display_name']
for item in response['results'][0]['authorships']:
char = {
'magid': str(x),
'display_name': display_name,
'pupdate': pupdate,
'author': item['author']['display_name'],
'institution_id': item['institutions'][0]['id']
}
charlist.append(char)
return charlist
finallist = []
for x in magid_list:
print(x)
data = main_request(baseurl, str(x))
finallist.extend(parse_json(main_request(baseurl, str(x))))
df = pd.DataFrame(finallist)
print(df.head(), df.tail())
If I can provide further information or clarification, let me know.
Attached you can find the full IndexError Traceback:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
f:\AlexPE\__programming\Masterarbeit.ipynb Cell 153 in <cell line: 37>()
37 for x in list:
38 print(x)
---> 39 finallist.extend(parse_json(main_request(baseurl, str(x))))
41 df = pd.DataFrame(finallist)
43 #data = main_request(baseurl, endpoint)
44 #print(get_pages(data))
45 #print(parse_json(data))
f:\AlexPE\__programming\Masterarbeit.ipynb Cell 153 in parse_json(response)
20 display_name = data['results'][0]['display_name']
23 for item in response['results'][0]['authorships']:
24 char = {
25 'magid': str(x),
26 'display_name': display_name,
27 'pupdate': pupdate,
28 'author': item['author']['display_name'],
---> 29 'institution_id': item['institutions'][0]['id']
30 }
32 charlist.append(char)
33 return charlist
IndexError: list index out of range
CodePudding user response:
Check for the existence of values before attempting to access them:
def parse_json(response):
charlist = []
pupdate = display_name = None
if data['results']:
pupdate = data['results'][0].get('publication_date')
display_name = data['results'][0].get('display_name')
for item in response['results'][0]['authorships']:
institution_id = None
if item['institutions']:
institution_id = item['institutions'][0].get('id')
char = {
'magid': str(x),
'display_name': display_name,
'pupdate': pupdate,
'author': item['author']['display_name'],
'institution_id': institution_id
}
charlist.append(char)
return charlist