Extracting PubMed data in xml format from txt batches in Python-CodePudding

I asked this question before and it was a perfect solution

A perfectly working code for multiple traditional xml files is below.

import pandas as pd
from glob import glob
from bs4 import BeautifulSoup

l = list()

for f in glob('*.xml'): # Changed to .txt here
    pub = dict()

    with open(f, 'r') as xml_file:
        xml = xml_file.read()

    soup = BeautifulSoup(xml, "lxml")
    pub['PMID'] = soup.find('pmid').text
    pub_list = soup.find('publicationtypelist')
    pub['Publication_type'] = list()
    for pub_type in pub_list.find_all('publicationtype'):
    pub['Publication_type'].append(pub_type.text)
    try:
        pub['NCTID'] = soup.find('accessionnumber').text
    except:
        pub['NCTID'] = None
    l.append(pub)

 df = pd.DataFrame(l)
 df = df.explode('Publication_type', ignore_index=True)

It gave me my desired output

    PMID        Publication_type    NCTID
0   34963793    Journal Article     NCT02649218
1   34963793    Review              NCT02649218
2   34535952    Journal Article     None
3   34090787    Journal Article     NCT02424799
4   33615122    Journal Article     NCT01922037

The only thing I changed since - I extracted data, using R and easyPubMed package. Data was extracted in batches (100 articles each) and stored in xml format in txt docs. I have 150 txt documents in total. Instead of ~25000 rows it now extracts only ~250.

How to update the Python code above and get the same output, when the input files have changed? I add several txt files here for reproducibility. Need to extract PMID, Publication_type, NCTID.

CodePudding user response：

Previous code only builds a data frame for an XML of a single article not an XML of hundreds of articles. Therefore, you need to capture select nodes under every <PubmedArticle> instance in XML. Right now only the first article is being captured in each XML.

Consider etree's iterparse solution that is less memory-intensive to read large XML where you extract needed nodes between opening and closing of <PubmedArticle> nodes:

import pandas as pd
import xml.etree.ElementTree as ET

data =  []                               # INITIALIZE DATA LIST
for xml_file in glob('*.txt'):
    for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
        if event == 'start':
            if elem.tag == "PubmedArticle":
                pub = {}                 # INITIALIZE ARTICLE DICT

            if elem.tag == 'PMID':
                pub["PMID"] = elem.text
                pub["PublicationType"] = []
                pub["NCTID"] = None

            elif elem.tag == 'PublicationType':
                pub["PublicationType"].append(elem.text)
                
            elif elem.tag == 'AccessionNumber':
                pub["NCTID"] = elem.text

        if event == 'end':
            if elem.tag == "PubmedArticle":
                pub["Source"] = xml_file
                data.append(pub)         # APPEND MULTIPLE ARTICLES

        elem.clear()

# BUILD XML DATA FRAME
final_df = (
    pd.DataFrame(data)
      .explode('PublicationType', ignore_index=True)
)