Home > OS >  Creating dataframe from XML file with non-unique tags
Creating dataframe from XML file with non-unique tags

Time:02-05

I have a directory of XML files, and I need to extract 4 values from each file and store to a dataframe/CSV.

The problem is some of the data I need to extract uses redundant tags (e.g., <PathName>) so I'm not sure of the best way to do this. I could specify the exact line # to extract, because it appears consistent with the files I have seen; but I am not certain that will always be the case, so doing it that way is too brittle.

<?xml version="1.0" encoding="utf-8"?>
<BxfMessage xsi:schemaLocation="http://smpte-ra.org/schemas/2021/2019/BXF BxfSchema.xsd" id="jffsdfs" dateTime="2023-02-02T20:11:38Z" messageType="Info" origin="url" originType="Delivery" userName="ABC Corp User" destination=" System" xmlns="http://sffe-ra.org/schema/1999/2023/BXF" xmlns:xsi="http://www.w9.org/4232/XMLSchema-instance">
  <BxfData action="Spotd">
    <Content timestamp="2023-02-02T20:11:38Z">
      <NonProgramContent>
        <Details>
          <SpotType>Paid</SpotType>
          <SpotType>Standard</SpotType>
          <Spotvertiser>
            <SpotvertiserName>Spot Plateau</SpotvertiserName>
          </Spotvertiser>
          <Agency>
            <AgencyName>Spot Plateau</AgencyName>
          </Agency>
          <Product>
            <Name></Name>
            <BrandName>zzTop</BrandName>
            <DirectResponse>
              <PhoneNo></PhoneNo>
              <PCode></PCode>
              <DR_URL></DR_URL>
            </DirectResponse>
          </Product>
        </Details>
        <ContentMetSpotata>
          <ContentId>
            <BHGXId idType="CISC" auth="Agency">AAAA1111999Z</BHGXId>
          </ContentId>
          <Name>Pill CC Dutch</Name>
          <Policy>
            <PlatformType>Spotcast</PlatformType>
          </Policy>
          <Media>
            <BaseBand>
              <Audio VO="true">
                <AnalogAudio primAudio="false" />
                <DigitalAudio>
                  <MPEGLayerIIAudio house="false" audioId="1" dualMono="false" />
                </DigitalAudio>
              </Audio>
              <Video withlate="false" sidebend="false">
                <Format>1182v</Format>
                <CCs>true</CCs>
              </Video>
              <AccessServices>
                <AudioDescription_DVS>false</AudioDescription_DVS>
              </AccessServices>
              <QC>Passed QC (AAAA1111103H )</QC>
            </BaseBand>
            <MediaLocation sourceType="Primary">
              <Location>
                <AssetServer PAA="true" FTA="true">
                  <PathName>zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb.mp4</PathName>
                </AssetServer>
              </Location>
              <SOM>
                <SmpteTimeCode>00:00:00;00</SmpteTimeCode>
              </SOM>
              <Duration>
                <SmpteDuration>
                  <SmpteTimeCode>00:00:30;00</SmpteTimeCode>
                </SmpteDuration>
              </Duration>
            </MediaLocation>
            <MediaLocation sourceType="Proxy" qualifer="Low-res">
              <Location>
                <AssetServer PAA="true" FTA="true">
                  <PathName>https://app.url.com/DMM/DL/wew52f</PathName>
                </AssetServer>
              </Location>
              <SOM>
                <SmpteTimeCode>00:00:00;00</SmpteTimeCode>
              </SOM>
              <Duration>
                <SmpteDuration>
                  <SmpteTimeCode>00:00:30;00</SmpteTimeCode>
                </SmpteDuration>
              </Duration>
            </MediaLocation>
            <MediaLocation sourceType="Preview" qualifer="Thumbnail">
              <Location>
                <AssetServer PAA="true" FTA="true">
                  <PathName>https://f9-int-5.rainxyz.com/url.com/media/t43fs/423gs-389a-40a4.jpg?inline</PathName>
                </AssetServer>
              </Location>
              <SOM>
                <SmpteTimeCode>00:00:00;00</SmpteTimeCode>
              </SOM>
              <Duration>
                <SmpteDuration>
                  <SmpteTimeCode>00:00:00;00</SmpteTimeCode>
                </SmpteDuration>
              </Duration>
            </MediaLocation>
          </Media>
        </ContentMetSpotata>
      </NonProgramContent>
    </Content>
  </BxfData>
</BxfMessage>

Is there a more flexible method so that I can get consistent output like:

FileName                                Brand   ID              URL
zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb zzTop   AAAA1111999Z    https://app.url.com/DMM/DL/wew52f
zzTap_zzTab_BAAA1111999Z_30s_Pill_aa-cc zzTab   BAAA1111999Z    https://app.url.com/DMM/DL/wew52c
zzTap_zzTan_CAAA1111999Z_30s_Pill_aa-dd zzTan   CAAA1111999Z    https://app.url.com/DMM/DL/wew523
zzTap_zzTon_DAAA1111999Z_30s_Pill_aa-zz zzTon   DAAA1111999Z    https://app.url.com/DMM/DL/wew52y

CodePudding user response:

How looks your code? Here is my try.

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("zzTab.xml")
root = tree.getroot()

ns = "{http://sffe-ra.org/schema/1999/2023/BXF}"
list_of_interest = [f"{ns}PathName", f"{ns}BHGXId", f"{ns}BrandName"]

PathName_dir_list = []
PathName_file_list = []
BHGXId_list = []
BrandName_list = []

for elem in root.iter():
    #print(elem.tag, elem.text)
    if elem.tag in list_of_interest:
        if elem.tag == f"{ns}PathName" and '.mp4' not in elem.text:
            #print("Dir:",elem.text)
            PathName_dir_list.append(elem.text)
        if elem.tag == f"{ns}PathName" and '.mp4' in elem.text:
            #print("File:",elem.text)
            PathName_file_list.append(elem.text)          
        if elem.tag == f"{ns}BHGXId":
            #print("ID", elem.text)
            BHGXId_list.append(elem.text)
        if elem.tag == f"{ns}BrandName":
            print("Brand", elem.text)
            BrandName_list.append(elem.text)

t = zip(PathName_dir_list, PathName_file_list,  BHGXId_list, BrandName_list,)
list_of_tuples = list(t)

df = pd.DataFrame(list_of_tuples, columns = ['Path', 'File', 'ID', 'Brand'])
df.to_csv('file_list.csv')
print(df)

CodePudding user response:

To parse one XML file using beautifulsoup you can use this example:

from bs4 import BeautifulSoup

def get_info(xml_file):
    with open(xml_file, 'r') as f_in:
        soup = BeautifulSoup(f_in.read(), 'xml')

    file_name = soup.find(lambda tag: tag.name == 'PathName' and '.mp4' in tag.text).text.rsplit('.mp4', maxsplit=1)[0]
    url = soup.select_one('[sourceType="Proxy"] PathName').text
    brand_name = soup.select_one('BrandName').text
    id_ = soup.select_one('BHGXId').text

    return file_name, brand_name, id_, url

print(get_info('your_file.xml'))

Prints:

('zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb', 'zzTop', 'AAAA1111999Z', 'https://app.url.com/DMM/DL/wew52f')
  • Related