First time parsing an xml
file and I'm following both this pandas explanation and this SO question. I have an xml file from pubmed (any should work but I downloaded the first one: pubmed22n1115.xml
). This file seems to be very convoluted and much more complex than the SO/pandas explanations and I can't seem to be able to parse it.
What I tried is:
import pandas as pd
df = pd.read_xml('../../Downloads/pubmed22n1115.xml')
df.head()
>>>
MedlineCitation PubmedData PMID
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
All the other examples I looked at for parsing xml files were very specific to the xml file structure and I can't seem to follow.
The only 2 things I need from this file are PMID
, AbstractText
. The expected output is a pandas dataframe that looks like
PMID AbstractText
0 1212 text1
1 1233 text2
CodePudding user response:
You need to drill down into that huge XML file, in order to display the relevant data. You do this with xpath in pandas, like so (this is on a random xml doc downloaded from that link):
import pandas as pd
df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//PMID")
print(df)
This will print out in terminal:
Version PMID
0 1 14584002
1 1 16916636
2 1 34919821
3 1 17541330
4 1 17643379
... ... ...
18359 1 34919510
18360 1 34919742
18361 1 34919747
18362 1 34919751
18363 1 34919752
The following pandas documentation might be helpful:
https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html
EDIT: You can get AbstractText
with:
df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//AbstractText")
print(df)
Resulting in:
Label NlmCategory AbstractText sup i sub b u {http://www.w3.org/1998/Math/MathML}math
0 BACKGROUND BACKGROUND Kawasaki disease is the most common cause of a... None None None None None NaN
1 OBJECTIVES OBJECTIVE The objective of this review was to evaluate t... None None None None None NaN
2 SEARCH STRATEGY METHODS Electronic searches of the Cochrane Peripheral... None None None None None NaN
3 SELECTION CRITERIA METHODS Randomised controlled trials of intravenous im... None None None None None NaN
4 DATA COLLECTION AND ANALYSIS METHODS Fifty-nine trials were identified in the initi... None None None None None NaN
... ... ... ... ... ... ... ... ... ...