Extracting pmid, nct_id, and publication type from PubMed xml in Python-CodePudding

From xml file like this I try to extract pmid, nct_id, and publication type. Four sample files are here, one has NCT_ID.

<PMID Version="1">144418</PMID>
<PublicationType UI="D016428">Journal Article</PublicationType>
<AccessionNumber>NCT03070782</AccessionNumber>

Ideally I want to have a pd dataframe:

Expected output:

PMID    Publication_type  NCTID
1       Journal article   NCT03070782
2       Journal article   NaN
3       Journal article   NaN

But if somebody can tell at least how to extract for 1 file, I would greatly appreciate it too! I will figure it out how to put it into a dataframe.

CodePudding user response：

Use glob to iterate through all XML files
Use BeautifulSoup to parse XML content
Use soup.find() and soup.find_all() to find elements in the XML
Use .text() to get the string from text node under the element
Handle exception using try and except for NCTID
Store content as a dict and append to a list
Use pd.DataFrame(<list>) to create dataframe from given list
Note that each PMID might contain multiple Publication_type, so, use explode() to split the list of Publication_type into multiple rows referred to the PMID

Code:

import pandas as pd
from glob import glob
from bs4 import BeautifulSoup

l = list()

for f in glob('*.xml'):
    pub = dict()

    with open(f, 'r') as xml_file:
        xml = xml_file.read()

    soup = BeautifulSoup(xml, "lxml")
    pub['PMID'] = soup.find('pmid').text
    pub_list = soup.find('publicationtypelist')
    pub['Publication_type'] = list()
    for pub_type in pub_list.find_all('publicationtype'):
        pub['Publication_type'].append(pub_type.text)
    try:
        pub['NCTID'] = soup.find('accessionnumber').text
    except:
        pub['NCTID'] = None
    l.append(pub)

df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)

Output:

>         PMID    Publication_type    NCTID
>     0   144418  Journal Article     None
>     1   272056  English Abstract    None
>     2   272056  Journal Article     None
>     3   349115  Editorial   None
>     4   349115  Historical Article  None
>     5   31893580    Clinical Trial, Phase II    NCT03070782
>     6   31893580    Journal Article NCT03070782
>     7   31893580    Multicenter Study   NCT03070782
>     8   31893580    Randomized Controlled Trial NCT03070782
>     9   31893580    Research Support, Non-U.S. Gov't    NCT03070782