From xml file like this I try to extract pmid, nct_id, and publication type. Four sample files are here, one has NCT_ID.
<PMID Version="1">144418</PMID>
<PublicationType UI="D016428">Journal Article</PublicationType>
<AccessionNumber>NCT03070782</AccessionNumber>
Ideally I want to have a pd dataframe:
Expected output:
PMID Publication_type NCTID
1 Journal article NCT03070782
2 Journal article NaN
3 Journal article NaN
But if somebody can tell at least how to extract for 1 file, I would greatly appreciate it too! I will figure it out how to put it into a dataframe.
CodePudding user response:
Use
glob
to iterate through all XML filesUse
BeautifulSoup
to parse XML contentUse
soup.find()
andsoup.find_all()
to find elements in the XMLUse
.text()
to get the string from text node under the elementHandle exception using
try
andexcept
forNCTID
Store content as a
dict
and append to alist
Use
pd.DataFrame(<list>)
to createdataframe
from givenlist
Note that each
PMID
might contain multiplePublication_type
, so, useexplode()
to split the list ofPublication_type
into multiple rows referred to thePMID
Code:
import pandas as pd
from glob import glob
from bs4 import BeautifulSoup
l = list()
for f in glob('*.xml'):
pub = dict()
with open(f, 'r') as xml_file:
xml = xml_file.read()
soup = BeautifulSoup(xml, "lxml")
pub['PMID'] = soup.find('pmid').text
pub_list = soup.find('publicationtypelist')
pub['Publication_type'] = list()
for pub_type in pub_list.find_all('publicationtype'):
pub['Publication_type'].append(pub_type.text)
try:
pub['NCTID'] = soup.find('accessionnumber').text
except:
pub['NCTID'] = None
l.append(pub)
df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)
Output:
> PMID Publication_type NCTID > 0 144418 Journal Article None > 1 272056 English Abstract None > 2 272056 Journal Article None > 3 349115 Editorial None > 4 349115 Historical Article None > 5 31893580 Clinical Trial, Phase II NCT03070782 > 6 31893580 Journal Article NCT03070782 > 7 31893580 Multicenter Study NCT03070782 > 8 31893580 Randomized Controlled Trial NCT03070782 > 9 31893580 Research Support, Non-U.S. Gov't NCT03070782