Home > Net >  Python / BeautifulSoup - Extracting XML data from clinicaltrials.gov, only able to extract studies t
Python / BeautifulSoup - Extracting XML data from clinicaltrials.gov, only able to extract studies t

Time:11-19

I am using clinicaltrials.gov's API to obtain a list of clinical trial data into a XML file and then I am parsing the data to eventually export into an Excel dataset.

In the URL provided in my code, there are 9 results, however my code is only pulling data for 5/9. I've realized it's because that for one of the fields (detaileddescription), only some of the trials have this data. When I remove detaileddescription and just use the two other fields (nctid and briefdescription), I am able to get 9/9. What can I do here besides doing something messy like creating a separate dataframe for detaileddescription and merging?

Bottom line: I am extracting 3 fields from an XML file that includes 9 clinical trials: nctid, briefsummary, and detaileddescription, but my output is only extracting 5/9 clinical trials. How can my output get all 9/9 without taking out the detaileddescription field from my output?

import requests
from bs4 import BeautifulSoup
import pandas as pd

out = []
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=50&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
nctids = soup.find_all("field", {"name" : "NCTId"})
briefsummaries = soup.find_all("field", {"name" : "BriefSummary"}) if soup.find_all("field", {"name" : "BriefSummary"}) is not None else 'nothing'
detaileddescriptions = soup.find_all("field", {"name" : "DetailedDescription"}) if soup.find_all("field", {"name" : "DetailedDescription"}) is not None else 'nothing'

for nctid, briefsummary, detaileddescription in zip(nctids, briefsummaries, detaileddescriptions):
    
    data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
    out.append(data)
df = pd.DataFrame(out)

df.to_excel('clinicaltrialstresults.xlsx')

CodePudding user response:

You can try looping over the study list with slight changes to your code

import requests
from bs4 import BeautifulSoup
import pandas as pd


out = []
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=50&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
study_list = soup.find_all("fullstudy")

for study in study_list:
    nctid = study.find("field", {"name" : "NCTId"})
    briefsummary = study.find("field", {"name" : "BriefSummary"}) if study.find("field", {"name" : "BriefSummary"}) is not None else 'nothing'
    detaileddescription = study.find("field", {"name" : "DetailedDescription"}) if study.find("field", {"name" : "DetailedDescription"}) is not None else 'nothing'
    data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
    out.append(data)

df = pd.DataFrame(out)
df.to_excel('clinicaltrialstresults.xlsx', index=False)
  • Related