I am new to python so please bear with me with silly questions I have multiple xml in the following format and I would like to extract certain tags within those xmls and export them to a single csv file.
Here is an example of the xml (c:\xml\1.xml)
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="emotionStyleSheet_template.xsl"?>
<EmotionReport>
<VersionInformation>
<Version>8.2.0</Version>
</VersionInformation>
<DateTime>
<Date>18-10-2021</Date>
<Time>14-12-26</Time>
</DateTime>
<SourceInformation>
<File>
<FilePath>//nas/emotionxml</FilePath>
<FileName>file001.mxf</FileName>
<FileSize>9972536969</FileSize>
<FileAudioInformation>
<AudioDuration>1345.0</AudioDuration>
<SampleRate>48000</SampleRate>
<NumChannels>8</NumChannels>
<BitsPerSample>24</BitsPerSample>
<AudioSampleGroups>64560000</AudioSampleGroups>
<NumStreams>8</NumStreams>
<Container>Undefined Sound</Container>
<Description>IMC Nexio
</Description>
<StreamInformation>
<Stream>
<StreamNumber>1</StreamNumber>
<NumChannelsInStream>1</NumChannelsInStream>
<Channel>
<ChannelNumber>1</ChannelNumber>
<ChannelEncoding>PCM</ChannelEncoding>
</Channel>
</Stream>
<Stream>
<StreamNumber>2</StreamNumber>
<NumChannelsInStream>1</NumChannelsInStream>
<Channel>
<ChannelNumber>1</ChannelNumber>
<ChannelEncoding>PCM</ChannelEncoding>
</Channel>
</Stream>
</StreamInformation>
<FileTimecodeInformation>
<FrameRate>25.00</FrameRate>
<DropFrame>false</DropFrame>
<StartTimecode>00:00:00:00</StartTimecode>
</FileTimecodeInformation>
</FileAudioInformation>
</File>
</SourceInformation>
</EmotionReport>
expect output result (EmotionData.csv)
,Date,Time,FileName,Description,FileSize,FilePath
0,18-10-2021,14-12-26,file001.mxf,IMC Nexio,9972536969,//nas/emotionxml
1,13-10-2021,08-12-26,file002.mxf,IMC Nexio,3566536770,//nas/emotionxml
2,03-10-2021,02-09-21,file003.mxf,IMC Nexio,46357672,//nas/emotionxml
....
Here is the code I've wrote based on what I've learned from online resources (emotion_xml_parser.py):
import xml.etree.ElementTree as ET
import glob2
import pandas as pd
cols = ["Date", "Time", "FileName", "Description", "FileSize", "FilePath"]
rows = []
for filename in glob2.glob(r'C:\xml\*.xml'):
xmlData = ET.parse(filename)
rootXML = xmlData.getroot()
for i in rootXML:
Date = i.findall("Date").text
Time = i.findall("Time").text
FileName = i.findall("FileName").text
Description = i.findall("Description").text
FileSize = i.findall("FileSize").text
FilePath = i.findall("FilePath").text
row.append({"Date": Date,
"Time": Time,
"FileName": FileName,
"Description": Description,
"FileSize": FileSize,
"FilePath": FilePath,})
df = pd.DataFrame(rows,columns = cols)
# Write dataframe to csv
df.to_csv("EmotionData.csv")
I am receiving the following error when running the script
File "c:\emtion_xml_parser.py", line 14, in <module>
Date = i.findall("Date").text
AttributeError: 'list' object has no attribute 'text'
TIA!
CodePudding user response:
A better approach is to give the full path to each element you need, for example:
import xml.etree.ElementTree as ET
import glob2
import pandas as pd
cols = ["Date", "Time", "FileName", "Description", "FileSize", "FilePath"]
rows = []
for filename in glob2.glob(r'*.xml'):
xmlData = ET.parse(filename)
root = xmlData.getroot()
row = {
'Date' : root.findtext('DateTime/Date'),
'Time' : root.findtext('DateTime/Time'),
'FileName' : root.findtext('SourceInformation/File/FileName'),
'Description' : root.findtext('SourceInformation/File/FileAudioInformation/Description').strip(),
'FileSize' : root.findtext('SourceInformation/File/FileSize'),
'FilePath' : root.findtext('SourceInformation/File/FilePath')
}
rows.append(row)
df = pd.DataFrame(rows, columns=cols)
# Write dataframe to csv
df.to_csv("EmotionData.csv")
Giving you:
,Date,Time,FileName,Description,FileSize,FilePath
0,18-10-2021,14-12-26,file001.mxf,IMC Nexio,9972536969,//nas/emotionxml