I have a directory of XML files, and I need to extract 4 values from each file and store to a dataframe/CSV.
The problem is some of the data I need to extract uses redundant tags (e.g., <PathName>
) so I'm not sure of the best way to do this. I could specify the exact line # to extract, because it appears consistent with the files I have seen; but I am not certain that will always be the case, so doing it that way is too brittle.
<?xml version="1.0" encoding="utf-8"?>
<BxfMessage xsi:schemaLocation="http://smpte-ra.org/schemas/2021/2019/BXF BxfSchema.xsd" id="jffsdfs" dateTime="2023-02-02T20:11:38Z" messageType="Info" origin="url" originType="Delivery" userName="ABC Corp User" destination=" System" xmlns="http://sffe-ra.org/schema/1999/2023/BXF" xmlns:xsi="http://www.w9.org/4232/XMLSchema-instance">
<BxfData action="Spotd">
<Content timestamp="2023-02-02T20:11:38Z">
<NonProgramContent>
<Details>
<SpotType>Paid</SpotType>
<SpotType>Standard</SpotType>
<Spotvertiser>
<SpotvertiserName>Spot Plateau</SpotvertiserName>
</Spotvertiser>
<Agency>
<AgencyName>Spot Plateau</AgencyName>
</Agency>
<Product>
<Name></Name>
<BrandName>zzTop</BrandName>
<DirectResponse>
<PhoneNo></PhoneNo>
<PCode></PCode>
<DR_URL></DR_URL>
</DirectResponse>
</Product>
</Details>
<ContentMetSpotata>
<ContentId>
<BHGXId idType="CISC" auth="Agency">AAAA1111999Z</BHGXId>
</ContentId>
<Name>Pill CC Dutch</Name>
<Policy>
<PlatformType>Spotcast</PlatformType>
</Policy>
<Media>
<BaseBand>
<Audio VO="true">
<AnalogAudio primAudio="false" />
<DigitalAudio>
<MPEGLayerIIAudio house="false" audioId="1" dualMono="false" />
</DigitalAudio>
</Audio>
<Video withlate="false" sidebend="false">
<Format>1182v</Format>
<CCs>true</CCs>
</Video>
<AccessServices>
<AudioDescription_DVS>false</AudioDescription_DVS>
</AccessServices>
<QC>Passed QC (AAAA1111103H )</QC>
</BaseBand>
<MediaLocation sourceType="Primary">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb.mp4</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:30;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
<MediaLocation sourceType="Proxy" qualifer="Low-res">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>https://app.url.com/DMM/DL/wew52f</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:30;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
<MediaLocation sourceType="Preview" qualifer="Thumbnail">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>https://f9-int-5.rainxyz.com/url.com/media/t43fs/423gs-389a-40a4.jpg?inline</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
</Media>
</ContentMetSpotata>
</NonProgramContent>
</Content>
</BxfData>
</BxfMessage>
Is there a more flexible method so that I can get consistent output like:
FileName Brand ID URL
zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb zzTop AAAA1111999Z https://app.url.com/DMM/DL/wew52f
zzTap_zzTab_BAAA1111999Z_30s_Pill_aa-cc zzTab BAAA1111999Z https://app.url.com/DMM/DL/wew52c
zzTap_zzTan_CAAA1111999Z_30s_Pill_aa-dd zzTan CAAA1111999Z https://app.url.com/DMM/DL/wew523
zzTap_zzTon_DAAA1111999Z_30s_Pill_aa-zz zzTon DAAA1111999Z https://app.url.com/DMM/DL/wew52y
CodePudding user response:
How looks your code? Here is my try.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("zzTab.xml")
root = tree.getroot()
ns = "{http://sffe-ra.org/schema/1999/2023/BXF}"
list_of_interest = [f"{ns}PathName", f"{ns}BHGXId", f"{ns}BrandName"]
PathName_dir_list = []
PathName_file_list = []
BHGXId_list = []
BrandName_list = []
for elem in root.iter():
#print(elem.tag, elem.text)
if elem.tag in list_of_interest:
if elem.tag == f"{ns}PathName" and '.mp4' not in elem.text:
#print("Dir:",elem.text)
PathName_dir_list.append(elem.text)
if elem.tag == f"{ns}PathName" and '.mp4' in elem.text:
#print("File:",elem.text)
PathName_file_list.append(elem.text)
if elem.tag == f"{ns}BHGXId":
#print("ID", elem.text)
BHGXId_list.append(elem.text)
if elem.tag == f"{ns}BrandName":
print("Brand", elem.text)
BrandName_list.append(elem.text)
t = zip(PathName_dir_list, PathName_file_list, BHGXId_list, BrandName_list,)
list_of_tuples = list(t)
df = pd.DataFrame(list_of_tuples, columns = ['Path', 'File', 'ID', 'Brand'])
df.to_csv('file_list.csv')
print(df)
CodePudding user response:
To parse one XML file using beautifulsoup
you can use this example:
from bs4 import BeautifulSoup
def get_info(xml_file):
with open(xml_file, 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'xml')
file_name = soup.find(lambda tag: tag.name == 'PathName' and '.mp4' in tag.text).text.rsplit('.mp4', maxsplit=1)[0]
url = soup.select_one('[sourceType="Proxy"] PathName').text
brand_name = soup.select_one('BrandName').text
id_ = soup.select_one('BHGXId').text
return file_name, brand_name, id_, url
print(get_info('your_file.xml'))
Prints:
('zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb', 'zzTop', 'AAAA1111999Z', 'https://app.url.com/DMM/DL/wew52f')