1.I have an xml file as follows:
<BIBDS>
<METADATA-TABLE Resource="ACTIVE">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> ydfgfbcs12 dq_EMAIL mail </DATA>
<DATA> asiuertb45 ss_FIRST_NAME FirstName </DATA>
<DATA> pojkeu12er fg_LAST_NAME LastName </DATA>
</METADATA-TABLE>
<METADATA-TABLE Resource="OFFICIAL">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> thsgdqw9uq dk_EMAIL mail </DATA>
<DATA> okjnsdqw12 kl_FIRST_NAME FirstName </DATA>
<DATA> tgetiq34er ll_LAST_NAME LastName </DATA>
</METADATA-TABLE>
</BIBDS>
this the code I've come up with so far:
import xml.etree.ElementTree as et
tree = et.parse('filepath')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
columntable = mt.find('COLUMNS').text
column_metadata_table.append(columntable.split('\t'))
break
data_metadata_table = []
for mt in tree.iter('METADATA-TABLE'):
datatable = mt.findall("DATA")
for dat in datatable:
data_metadata_table.append(dat.text.split('\t'))
df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)
this will give me an output with column names from (column-tag) and data in it with from (data-tag) but I need another column with the value of resource in it with column name as resource
expected output as dataframe:
Metadata SystemID Standard Resource
ydfgfbcs12 dq_EMAIL mail ACTIVE
asiuertb45 ss_FIRST_NAME FirstName ACTIVE
pojkeu12er fg_LAST_NAME LastName ACTIVE
thsgdqw9uq dk_EMAIL mail OFFICIAL
okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
tgetiq34er ll_LAST_NAME LastName OFFICIAL
CodePudding user response:
Tweaking OP code a bit by adding missing column name and value to relevant lists:
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse('tmp.xml')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
columntable = mt.find('COLUMNS').text
column_metadata_table = columntable.strip().split('\t')
break
column_metadata_table.append('RESOURCE')
data_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
datatable = mt.findall("DATA")
for dat in datatable:
td = dat.text.strip().split('\t')
td.append(mt.attrib['Resource'])
data_metadata_table.append(td)
#data_metadata_table
print(column_metadata_table)
print(data_metadata_table)
df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)
print(df_metadata_table)
Result
Metadata SystemID Standard RESOURCE
0 ydfgfbcs12 dq_EMAIL mail ACTIVE
1 asiuertb45 ss_FIRST_NAME FirstName ACTIVE
2 pojkeu12er fg_LAST_NAME LastName ACTIVE
3 thsgdqw9uq dk_EMAIL mail OFFICIAL
4 okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
5 tgetiq34er ll_LAST_NAME LastName OFFICIAL
CodePudding user response:
You can do it efficiently with BeautifulSoup:
from bs4 import BeautifulSoup
import pandas as pd
import re
with open('filepath') as f:
soup = BeautifulSoup(f.read())
whitespace_rx = re.compile(r'\s ')
df = pd.concat([
pd.DataFrame(
data=[whitespace_rx.split(row.text.strip()) for row in mt.find_all('data')],
columns=whitespace_rx.split(mt.find('columns').text.strip())
).assign(Resource=mt.get('resource'))
for mt in soup.find_all('metadata-table')
]).reset_index(drop=True)
Output:
>>> df
Metadata SystemID Standard Resource
0 ydfgfbcs12 dq_EMAIL mail ACTIVE
1 asiuertb45 ss_FIRST_NAME FirstName ACTIVE
2 pojkeu12er fg_LAST_NAME LastName ACTIVE
3 thsgdqw9uq dk_EMAIL mail OFFICIAL
4 okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
5 tgetiq34er ll_LAST_NAME LastName OFFICIAL
CodePudding user response:
The code below seems to work (Note that the code does not use any external library for parsing)
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<BIBDS>
<METADATA-TABLE Resource="ACTIVE">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> ydfgfbcs12 dq_EMAIL mail </DATA>
<DATA> asiuertb45 ss_FIRST_NAME FirstName </DATA>
<DATA> pojkeu12er fg_LAST_NAME LastName </DATA>
</METADATA-TABLE>
<METADATA-TABLE Resource="OFFICIAL">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> thsgdqw9uq dk_EMAIL mail </DATA>
<DATA> okjnsdqw12 kl_FIRST_NAME FirstName </DATA>
<DATA> tgetiq34er ll_LAST_NAME LastName </DATA>
</METADATA-TABLE>
</BIBDS>'''
LOOKUP = {0:'Metadata',1:'SystemID',2:'Standard'}
root = ET.fromstring(xml)
df_data = []
for meta in root.findall('.//METADATA-TABLE'):
res = meta.attrib['Resource']
for data in meta.findall('DATA'):
entry = {'Resource':res}
elements = data.text.split()
for idx,element in enumerate(elements):
entry[LOOKUP[idx]] = element
df_data.append(entry)
df = pd.DataFrame(df_data)
print(df)
output
Resource Metadata SystemID Standard
0 ACTIVE ydfgfbcs12 dq_EMAIL mail
1 ACTIVE asiuertb45 ss_FIRST_NAME FirstName
2 ACTIVE pojkeu12er fg_LAST_NAME LastName
3 OFFICIAL thsgdqw9uq dk_EMAIL mail
4 OFFICIAL okjnsdqw12 kl_FIRST_NAME FirstName
5 OFFICIAL tgetiq34er ll_LAST_NAME LastName