Home > Blockchain >  Generate XML file from pandas with python
Generate XML file from pandas with python

Time:02-01

I have a pandas dataframe that looks like this :

Date             Min       Max         C
 
01.01.2003     01.01.2003  Nan         NaT 
02.01.2003     Nan         Nan         NaT    
03.01.2003     Nan         Nan         NaT
04.01.2003     Nan         04.01.2003  NaT 
06.01.2003     06.01.2003  Nan         NaT
07.01.2003     Nan         Nan         NaT
08.01.2003     Nan         08.01.1993  NaT
09.01.2003     Nan         Nan         09.01.2003
14.01.2003     14.01.2003  Nan         NaT
15.01.2003     Nan         Nan         NaT           
16.01.2003     Nan         16.01.2003  NaT
29.01.2003     Nan         Nan         29.01.2003 

And I want to get a XML data that looks like this

-<Or>
  -<Date Source="test" test2="test3">
   <Min>01.01.2003</Min>
   <Max>01.04.2003</Max>
  </Date>
  -<Date Source="test" test2="test3">
   <Min>06.01.2003</Min>
   <Max>08.01.2003</Max>
  </Date>
  -<Date Source="test" test2="test3">
   <Min>14.01.2003</Min>
   <Max>16.01.2003</Max>
  </Date>
  -<Date Source="="test" test2="test3">
    09.01.2003 29.01.2023
   <Dates></Dates>
</Date>

This is the code :

data = gfg.Element("Or")
for idx, row in data.iterrows():
    element1 = gfg.SubElement(data, "Test")
    element2 = gfg.SubElement(data, "Test2")                
    s_elem1 = gfg.SubElement(element1, 'Min')
    s_elem2 = gfg.SubElement(element1, 'Max')
    s_elem1.text=row['Min']
    s_elem2.text=row['Max']
    
    b_xml=gfg.tostring(data)

Because it loops over row which then there will be max or min would be empty/blank space. what should I modified so I can get like the example above ? thank

CodePudding user response:

As commented, consider aggregating data before exporting to XML:

Data

from io import StringIO
import pandas as pd

txt = '''\
Date             Min       Max         C
01.01.2003     01.01.2003  NaT         NaT 
02.01.2003     NaT         NaT         NaT    
03.01.2003     NaT         NaT         NaT
04.01.2003     NaT         04.01.2003  NaT 
06.01.2003     06.01.2003  NaT         NaT
07.01.2003     NaT         NaT         NaT
08.01.2003     NaT         08.01.1993  NaT
09.01.2003     NaT         NaT         09.01.2003
14.01.2003     14.01.2003  NaT         NaT
15.01.2003     NaT         NaT         NaT           
16.01.2003     NaT         16.01.2003  NaT
29.01.2003     NaT         NaT         29.01.2003 
'''

with StringIO(txt) as f:
   dates_df = pd.read_csv(f, sep="\s ", parse_dates=["Min", "Max", "C"], dayfirst=True)

Cleaning / Grouping / Aggregation

agg_dates_df = (
    dates_df.dropna(how="all", axis="rows", subset=["Min", "Max"])
        .drop(["Date", "C"], axis="columns")
        .assign(
            Group = lambda df: pd.notnull(df["Min"]).astype('int').cumsum()
        )
        .groupby(["Group"])[["Min", "Max"]].max()
        .apply(lambda col: col.dt.strftime("%d.%m.%Y"), axis="columns")
        .assign(Source = "test", test2 = "test3")
)

print(agg_dates_df) 
#               Min         Max Source  test2
# Group                                      
# 1      01.01.2003  04.01.2003   test  test3
# 2      06.01.2003  08.01.1993   test  test3
# 3      14.01.2003  16.01.2003   test  test3

XML

output = dates_df.to_xml(
    index = False,
    root_name = "Or",
    row_name = "Date",
    attr_cols = ["Source", "test2"],
    elem_cols = ["Min", "Max"]
)

print(output)
# <?xml version='1.0' encoding='utf-8'?>
# <Or>
#   <Date Source="test" test2="test3">
#     <Min>01.01.2003</Min>
#     <Max>04.01.2003</Max>
#   </Date>
#   <Date Source="test" test2="test3">
#     <Min>06.01.2003</Min>
#     <Max>08.01.1993</Max>
#   </Date>
#   <Date Source="test" test2="test3">
#     <Min>14.01.2003</Min>
#     <Max>16.01.2003</Max>
#   </Date>
# </Or>
  • Related