I am using clinicaltrials.gov's API to obtain a list of clinical trial data into a XML file and then I am parsing the data to eventually export into an Excel dataset.
In the URL provided in my code, there are 9 results, however my code is only pulling data for 5/9. I've realized it's because that for one of the fields (detaileddescription), only some of the trials have this data. When I remove detaileddescription and just use the two other fields (nctid and briefdescription), I am able to get 9/9. What can I do here besides doing something messy like creating a separate dataframe for detaileddescription and merging?
Bottom line: I am extracting 3 fields from an XML file that includes 9 clinical trials: nctid, briefsummary, and detaileddescription, but my output is only extracting 5/9 clinical trials. How can my output get all 9/9 without taking out the detaileddescription field from my output?
import requests
from bs4 import BeautifulSoup
import pandas as pd
out = []
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=50&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
nctids = soup.find_all("field", {"name" : "NCTId"})
briefsummaries = soup.find_all("field", {"name" : "BriefSummary"}) if soup.find_all("field", {"name" : "BriefSummary"}) is not None else 'nothing'
detaileddescriptions = soup.find_all("field", {"name" : "DetailedDescription"}) if soup.find_all("field", {"name" : "DetailedDescription"}) is not None else 'nothing'
for nctid, briefsummary, detaileddescription in zip(nctids, briefsummaries, detaileddescriptions):
data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
out.append(data)
df = pd.DataFrame(out)
df.to_excel('clinicaltrialstresults.xlsx')
CodePudding user response:
You can try looping over the study list with slight changes to your code
import requests
from bs4 import BeautifulSoup
import pandas as pd
out = []
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=50&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
study_list = soup.find_all("fullstudy")
for study in study_list:
nctid = study.find("field", {"name" : "NCTId"})
briefsummary = study.find("field", {"name" : "BriefSummary"}) if study.find("field", {"name" : "BriefSummary"}) is not None else 'nothing'
detaileddescription = study.find("field", {"name" : "DetailedDescription"}) if study.find("field", {"name" : "DetailedDescription"}) is not None else 'nothing'
data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
out.append(data)
df = pd.DataFrame(out)
df.to_excel('clinicaltrialstresults.xlsx', index=False)