Home > Mobile >  Parsing multiple xml files in python and appending the data to a Python DataFrame
Parsing multiple xml files in python and appending the data to a Python DataFrame

Time:05-10

I'm trying to create a data frame from multiple nested xml files and append the data to a single data frame. I know the structure of the data frame and have defined it.

tree_list = []
details = ['FirstName','LastName','City','Country']


for file in bucket_list:
    obj = s3.Object(s3_bucket_name,file)
    data = (obj.get()['Body'].read())
    tree_list.append(ET.ElementTree(ET.fromstring(data)))

def parse_XML(list_of_trees, df_cols): 
    
    for tree in tree_list:
        xroot = tree.getroot()
        rows = []
    
    
    
    for node in xroot: 
        res = []
        for el in df_cols[0:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i-1]: res[i-1] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

parse_XML(tree_list,details)

In my output data frame, I'm getting the information of the last file read and a couple of blank rows like the following:

    FirstName LastName    City     Country
    Ted       Mosbey      Washington  USA
    None      None        None       None
    None      None        None       None

What changes should be made in the code to read all the files, append to the data frame and remove the unnecessary rows? Any suggestion to process the files efficiently is appreciated.

XML example:

<PD>
  <Clt>
    <PType>xxxx</PType>
    <PNumber>xxxxx</PNumber>
    <UID>xxxx</UID>
    <TEfd>xxxxx</TEfd>
    <TExd>xxxxxx</TExd>
    <DID>xxxxx</DID>
    <CType>xxxxx</CType>
    <FirstName>Ted</FirstName>
    <MiddleName></MiddleName>
    <LastName>Mosbey</LastName>
    <MailingAddrLocation>Home</MailingAddrLocation>
    <AddressLine1>3435</AddressLine1>
    <AddressLine2>Columbia RD</AddressLine2>
    <AddressLine3></AddressLine3>
    <City>Washington</City>
    <State>DC</State>
    <ZipCode>20009</ZipCode>
    <Country>USA</Country>
    <Pr>
      <PrType>xxxxx</PrType>
      <PrName>xxxxxx</PrName>
      <PrID>xxxxxx</PrID>
    </Pr>
</Clt>
</PD>

CodePudding user response:

So now when I have your data sample, I tested this and it works for me just as I think you want it to be:

def parse_XML(list_of_trees, df_cols):

    def get_el(el_list):
        if len(el_list) > 1:
            return [el_text.text for el_text in el_list]
        else:
            return el_list[0].text
    rows = []
    for tree in list_of_trees:
        xroot = tree.getroot()

        for node in xroot:
            res = []
            for el in df_cols[0:]:
                if node is not None and node.findall(f".//{el}") is not None:
                    res.append(get_el(node.findall(f".//{el}")))
            rows.append(res)

    out_df = pd.DataFrame(rows, columns=df_cols)

    return out_df
  • Related