I have a directory of XML files, and I need to extract 4 values from each file and store to a dataframe/CSV.
The problem is some of the data I need to extract uses redundant tags (e.g., <PathName>
) so I'm not sure of the best way to do this. I could specify the exact line # to extract, because it appears consistent with the files I have seen; but I am not certain that will always be the case, so doing it that way is too brittle.
<?xml version="1.0" encoding="utf-8"?>
<BxfMessage xsi:schemaLocation="http://smpte-ra.org/schemas/2021/2019/BXF BxfSchema.xsd" id="jffsdfs" dateTime="2023-02-02T20:11:38Z" messageType="Info" origin="url" originType="Delivery" userName="ABC Corp User" destination=" System" xmlns="http://sffe-ra.org/schema/1999/2023/BXF" xmlns:xsi="http://www.w9.org/4232/XMLSchema-instance">
<BxfData action="Spotd">
<Content timestamp="2023-02-02T20:11:38Z">
<SpotvertiserName>Spot Plateau</SpotvertiserName>
<AgencyName>Spot Plateau</AgencyName>
<BHGXId idType="CISC" auth="Agency">AAAA1111999Z</BHGXId>
<Name>Pill CC Dutch</Name>
<Audio VO="true">
<AnalogAudio primAudio="false" />
<MPEGLayerIIAudio house="false" audioId="1" dualMono="false" />
<Video withlate="false" sidebend="false">
<QC>Passed QC (AAAA1111103H )</QC>
<MediaLocation sourceType="Primary">
<AssetServer PAA="true" FTA="true">
<MediaLocation sourceType="Proxy" qualifer="Low-res">
<AssetServer PAA="true" FTA="true">
<MediaLocation sourceType="Preview" qualifer="Thumbnail">
<AssetServer PAA="true" FTA="true">
Is there a more flexible method so that I can get consistent output like:
FileName Brand ID URL
zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb zzTop AAAA1111999Z https://app.url.com/DMM/DL/wew52f
zzTap_zzTab_BAAA1111999Z_30s_Pill_aa-cc zzTab BAAA1111999Z https://app.url.com/DMM/DL/wew52c
zzTap_zzTan_CAAA1111999Z_30s_Pill_aa-dd zzTan CAAA1111999Z https://app.url.com/DMM/DL/wew523
zzTap_zzTon_DAAA1111999Z_30s_Pill_aa-zz zzTon DAAA1111999Z https://app.url.com/DMM/DL/wew52y
CodePudding user response:
How looks your code? Here is my try.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("zzTab.xml")
root = tree.getroot()
ns = "{http://sffe-ra.org/schema/1999/2023/BXF}"
list_of_interest = [f"{ns}PathName", f"{ns}BHGXId", f"{ns}BrandName"]
PathName_dir_list = []
PathName_file_list = []
BHGXId_list = []
BrandName_list = []
for elem in root.iter():
#print(elem.tag, elem.text)
if elem.tag in list_of_interest:
if elem.tag == f"{ns}PathName" and '.mp4' not in elem.text:
if elem.tag == f"{ns}PathName" and '.mp4' in elem.text:
if elem.tag == f"{ns}BHGXId":
#print("ID", elem.text)
if elem.tag == f"{ns}BrandName":
print("Brand", elem.text)
t = zip(PathName_dir_list, PathName_file_list, BHGXId_list, BrandName_list,)
list_of_tuples = list(t)
df = pd.DataFrame(list_of_tuples, columns = ['Path', 'File', 'ID', 'Brand'])
CodePudding user response:
To parse one XML file using beautifulsoup
you can use this example:
from bs4 import BeautifulSoup
def get_info(xml_file):
with open(xml_file, 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'xml')
file_name = soup.find(lambda tag: tag.name == 'PathName' and '.mp4' in tag.text).text.rsplit('.mp4', maxsplit=1)[0]
url = soup.select_one('[sourceType="Proxy"] PathName').text
brand_name = soup.select_one('BrandName').text
id_ = soup.select_one('BHGXId').text
return file_name, brand_name, id_, url
('zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb', 'zzTop', 'AAAA1111999Z', 'https://app.url.com/DMM/DL/wew52f')