I'm trying to create a data frame from multiple nested xml files and append the data to a single data frame. I know the structure of the data frame and have defined it.
tree_list = []
details = ['FirstName','LastName','City','Country']
for file in bucket_list:
obj = s3.Object(s3_bucket_name,file)
data = (obj.get()['Body'].read())
tree_list.append(ET.ElementTree(ET.fromstring(data)))
def parse_XML(list_of_trees, df_cols):
for tree in tree_list:
xroot = tree.getroot()
rows = []
for node in xroot:
res = []
for el in df_cols[0:]:
if node is not None and node.find(el) is not None:
res.append(node.find(el).text)
else:
res.append(None)
rows.append({df_cols[i-1]: res[i-1]
for i, _ in enumerate(df_cols)})
out_df = pd.DataFrame(rows, columns=df_cols)
return out_df
parse_XML(tree_list,details)
In my output data frame, I'm getting the information of the last file read and a couple of blank rows like the following:
FirstName LastName City Country
Ted Mosbey Washington USA
None None None None
None None None None
What changes should be made in the code to read all the files, append to the data frame and remove the unnecessary rows? Any suggestion to process the files efficiently is appreciated.
XML example:
<PD>
<Clt>
<PType>xxxx</PType>
<PNumber>xxxxx</PNumber>
<UID>xxxx</UID>
<TEfd>xxxxx</TEfd>
<TExd>xxxxxx</TExd>
<DID>xxxxx</DID>
<CType>xxxxx</CType>
<FirstName>Ted</FirstName>
<MiddleName></MiddleName>
<LastName>Mosbey</LastName>
<MailingAddrLocation>Home</MailingAddrLocation>
<AddressLine1>3435</AddressLine1>
<AddressLine2>Columbia RD</AddressLine2>
<AddressLine3></AddressLine3>
<City>Washington</City>
<State>DC</State>
<ZipCode>20009</ZipCode>
<Country>USA</Country>
<Pr>
<PrType>xxxxx</PrType>
<PrName>xxxxxx</PrName>
<PrID>xxxxxx</PrID>
</Pr>
</Clt>
</PD>
CodePudding user response:
So now when I have your data sample, I tested this and it works for me just as I think you want it to be:
def parse_XML(list_of_trees, df_cols):
def get_el(el_list):
if len(el_list) > 1:
return [el_text.text for el_text in el_list]
else:
return el_list[0].text
rows = []
for tree in list_of_trees:
xroot = tree.getroot()
for node in xroot:
res = []
for el in df_cols[0:]:
if node is not None and node.findall(f".//{el}") is not None:
res.append(get_el(node.findall(f".//{el}")))
rows.append(res)
out_df = pd.DataFrame(rows, columns=df_cols)
return out_df