<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
<defintion id="1" old_id="0">Lang</defintion>
<defintion id="7" old_id="1">Eng</defintion>
How can I parse an XML file that looks like this? Here, I have multiple values within a single tag. I want to extract values such as "ID", and "OLD_ID" in a list
or dataframe
format.
Edit Case2 :
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" new_id="12">
<level>1&1</level>
<typ>Green</typ>
<name>Alpha</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
<offer id="12" new_id="31">
<level>1&1</level>
<typ>Yellow</typ>
<name>Beta</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
</details>
</main_heading>
Expected Output
timestamp id new_id level name
20220113 11 12 1&1 Alpha
20220113 12 31 1&1 Beta
where NAME nested within the "visits" tag is not included in the count
timestamp = soup.find('main_heading').get('timestamp')
df[timestamp'] = timestamp
this solves one part (though here the column is coming at end instead of the beginning)
CodePudding user response:
You could use BeautifulSoup
and xml
parser to get your goal, simply select the elements needed and iterate ResultSet
to extract attribute values via .get()
.
with open('filename.xml', 'r') as f:
file = f.read()
soup = BeautifulSoup(file, 'xml')
Example
from bs4 import BeautifulSoup
import pandas as pd
xml = '''<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
<defintion id="1" old_id="0">Lang</defintion>
<defintion id="7" old_id="1">Eng</defintion>
'''
soup = BeautifulSoup(xml,'xml')
pd.DataFrame(
[
(e.get('id'),e.get('old_id'))
for e in soup.select('defintion')
],
columns = ['id','old_id']
)
Output
id | old_id | |
---|---|---|
0 | 1 | 0 |
1 | 7 | 1 |
CodePudding user response:
Using python Beautiful Soup, you could parse the .xml file to a Beatuful soup object and then use .findAll('defintions'). Then loop through the tags you find and get the desired values
object.findAll('defintions')
for defintion in defintions:
old_id = defintions['old_id']
id = defintions['id']
references: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://linuxhint.com/parse_xml_python_beautifulsoup/