How to parse XML with multiple attribute values within a single tag to DataFrame?-CodePudding

<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>

How can I parse an XML file that looks like this? Here, I have multiple values within a single tag. I want to extract values such as "ID", and "OLD_ID" in a list or dataframe format.

Edit Case2 :

<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" new_id="12">
        <level>1&amp;1</level>
        <typ>Green</typ>
        <name>Alpha</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
    <offer id="12" new_id="31">
        <level>1&amp;1</level>
        <typ>Yellow</typ>
        <name>Beta</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
</details>
</main_heading>

Expected Output

timestamp   id     new_id   level      name
20220113    11     12       1&amp;1    Alpha
20220113    12     31       1&amp;1    Beta

where NAME nested within the "visits" tag is not included in the count

timestamp = soup.find('main_heading').get('timestamp')
df[timestamp'] = timestamp

this solves one part (though here the column is coming at end instead of the beginning)

CodePudding user response：

You could use BeautifulSoup and xml parser to get your goal, simply select the elements needed and iterate ResultSet to extract attribute values via .get().

with open('filename.xml', 'r') as f:
    file = f.read() 
    soup = BeautifulSoup(file, 'xml')

Example

from bs4 import BeautifulSoup
import pandas as pd

xml = '''<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>
'''
soup = BeautifulSoup(xml,'xml')


pd.DataFrame(
    [
        (e.get('id'),e.get('old_id'))
        for e in soup.select('defintion')
    ],
    columns = ['id','old_id']
)

Output

	id	old_id
0	1	0
1	7	1

CodePudding user response：

Using python Beautiful Soup, you could parse the .xml file to a Beatuful soup object and then use .findAll('defintions'). Then loop through the tags you find and get the desired values

object.findAll('defintions')

for defintion in defintions:
    old_id = defintions['old_id']
    id = defintions['id']

references: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://linuxhint.com/parse_xml_python_beautifulsoup/