Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.

Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!

import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html

url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results

CodePudding user response：

you can filter on attributes like following:

m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})

and then iterate each result to get text, for ex:

official_titles = [result.text for result in m1_officialtitle]

for more info, you can check the documentation here

CodePudding user response：

You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:

import requests
from bs4 import BeautifulSoup


url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes telehealth peer support& AREA[StartDate] EXPAND[Term] RANGE[01/01/2020, 09/01/2020]&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})