Home > Mobile >  Parsing PubMed data and extracting multiple columns from multiple files
Parsing PubMed data and extracting multiple columns from multiple files

Time:05-18

I have multiple xml files from PubMed. Several files are here.

How to parse it and get these columns in a single dataframe. If an article has several authors, I want to have them as separate rows

Expected output (all authors should be included):

Title  Year ArticleTitle     LastName ForeName
Nature 2021 Inter-mosaic ... Roy      Suva
Nature 2021 Inter-mosaic ... Pearson  John
Nature 2021 Neural dynamics  Pearson  John
Nature 2021 Neural dynamics  Mooney   Richard

CodePudding user response:

First, what you want is doable. Something like this should work for your second file, and you could add other files by wrapping the code with a for loop:

from lxml import etree
import pandas as pd

doc = etree.parse('file.xml')

columns = ['Title','ArticleDate','ArticleTitle','LastName','ForeName']
title = doc.xpath(f'//{columns[0]}/text()')[0]
year = doc.xpath(f'//{columns[1]}//Year/text()')[0]
article_title = doc.xpath(f'//{columns[2]}/text()')[0]
rows = []
for auth in doc.xpath('//Author'):
    last_name = auth.xpath(f'{columns[3]}/text()')[0]
    fore_name = auth.xpath(f'{columns[4]}/text()')[0]
    rows.append([title,year,article_title,last_name,fore_name])
pd.DataFrame(rows,columns=columns)

Output (for 34671166.xml):

    Title   ArticleDate     ArticleTitle    LastName    ForeName
0   Nature  2021    Neural dynamics underlying birdsong practice a...   Singh Alvarado  Jonnathan
1   Nature  2021    Neural dynamics underlying birdsong practice a...   Goffinet    Jack
2   Nature  2021    Neural dynamics underlying birdsong practice a...   Michael     Valerie
3   Nature  2021    Neural dynamics underlying birdsong practice a...   Liberti     William
4   Nature  2021    Neural dynamics underlying birdsong practice a...   Hatfield    Jordan
5   Nature  2021    Neural dynamics underlying birdsong practice a...   Gardner     Timothy
6   Nature  2021    Neural dynamics underlying birdsong practice a...   Pearson     John
7   Nature  2021    Neural dynamics underlying birdsong practice a...   Mooney  Richard

Having said all that, I'm not sure a dataframe with each author in a separate line is the best idea for the type of data you have. In this example, since you have 8 co-authors, information such as the article title is repeated unnecessarily 8 times. You could give each author a separate set of columns, but then you'll have problems where articles have 3 or 10 co-authors...

  • Related