Unable to extract the data properly based on condition-CodePudding

I need to extract the nodes which are having RNAME- Applicable and their corresponding accounts.

<Reports>
   <Report>
      <ID>A1234</ID>
      <RNAME>Not Applicable</RNAME>
      <Accounts>
         <Account>
            <AID>12345</AID>
            <A2>TEST1</A2>
         </Account>
      </Accounts>
   </Report>
   <Report>
      <ID>A12345</ID>
      <RNAME>Applicable</RNAME>
      <Accounts>
         <Account>
            <AID>123456789</AID>
            <A2>TEST2</A2>
         </Account>
      </Accounts>
   </Report>
</Reports>

Here is the expected output for data frame

ID	AID	A2
A12345	123456789	TEST2

Here is my code:

for x in ET.fromstring(sample.xml).findall(".//Report"):
    RNAME = x.find("RNAME").text
    if RNAME != "Not Applicable":
       ID = x.find("ID").text print(ID) # It is printing A12345
    for y in ET.fromstring(content).findall('.//Report/Accounts/Account'):
        AID = y.find("AID").text
        A2 = y.find("A2").text
        print(AID, A2) # it is printing all 12345,TEST1 and 12345678 ,TEST2

CodePudding user response：

first install xmltodict

pip install xmltodict

then convert xml into json

import json
import xmltodict
with open("test.xml") as xml_file:
     data_dict = xmltodict.parse(xml_file.read())
     json_data = json.dumps(data_dict)
     ### here you have the data as json you can easily deal with them and write them to csv file ###

CodePudding user response：

Or you can try parsing the XML like this:

import pandas as pd
import xml.etree.ElementTree as ET

root = ET.fromstring('<Reports> <Report> <ID>A1234</ID> <RNAME>Not Applicable</RNAME> <Accounts> <Account> <AID >12345</AID > <A2>TEST1</A2> </Account> </Accounts> </Report> <Report> <ID>A12345</ID> <RNAME>Applicable</RNAME> <Accounts> <Account> <AID >123456789</AID > <A2>TEST2</A2> </Account> </Accounts> </Report> </Reports>')

mydata = pd.DataFrame(data = {'id': [r.text for r in root.findall('.//ID')],
                        'aid': [r.text for r in root.findall('.//AID')],
                        'a2': [r.text for r in root.findall('.//A2')]})

mydata.to_csv(path_here)