Home > Software engineering >  convert xml to csv by python
convert xml to csv by python

Time:04-15

My friends

In the following code, I try to convert XML (https://issat.ttn.tn/cu/export/akouda.php) to CSV file,

The Code :

import requests
import xml.etree.ElementTree as Xet
import pandas as pd
from html import unescape
url = "https://issat.ttn.tn/cu/export/akouda.php"

s = unescape(requests.get(url).text)[5:-6]

df = pd.read_xml(s, xpath="//phases/* | //time")#
#df["value"] = df["value"].ffill()
df
df.to_csv('output0.csv')

and here some of results :

,value,phases,id,act_energy,react_energy,current_inst,voltage_inst,power_inst,power_fact,thd
0,2022-04-14 15:45:00,,,,,,,,,
1,,,0.0,0.3000000000001819,0.4324445747717669,2.0,241.7,0.27,0.57,27.39
2,,,1.0,0.0,0.0,13.06,242.5,0.66,0.2,22.69
3,,,2.0,0.0,0.0,1.07,243.7,0.15,0.58,48.05
4,2022-04-14 15:30:00,,,,,,,,,
5,,,0.0,0.2999999999999545,0.108885460271677,1.02,240.4,0.23,0.94,23.7
6,,,1.0,0.0,0.0,14.54,241.0,0.86,0.24,23.99
7,,,2.0,0.0,0.0,1.07,243.5,0.15,0.59,48.08
8,2022-04-14 15:15:00,,,,,,,,,
9,,,0.0,0.3999999999998636,0.5618044649492236,0.7,243.1,0.1,0.58,42.46
10,,,1.0,0.0,0.0,17.82,241.9,1.99,0.46,33.59
11,,,2.0,0.0,0.0,1.08,246.3,0.15,0.58,51.09
12,2022-04-14 15:00:00,,,,,,,,,
13,,,0.0,0.6000000000001364,0.8427066974243144,0.71,241.7,0.1,0.58,44.02
14,,,1.0,0.0,0.0,18.74,240.5,2.21,0.49,31.3
15,,,2.0,0.0,0.0,1.08,245.3,0.15,0.58,51.77

I need to:

  1. remove the row like rows ( 0 & 4 & 8 & 12 ) that have date without readings.
  2. get the rows that have id = 1 only.
  3. remove the phases column.

Please, anyone can help?

CodePudding user response:

Try:

import requests
import pandas as pd
from html import unescape

url = "https://issat.ttn.tn/cu/export/akouda.php"

s = unescape(requests.get(url).text)[5:-6]

df = pd.read_xml(s, xpath="//phases/* | //time")

df["value"] = df["value"].ffill()
df = df.drop(columns="phases")
# if you want only id==1 you can skip this:
# df = df[~df.isna().any(axis=1)]
print(df[df["id"] == 1])

Prints:

                    value   id  act_energy  react_energy  current_inst  voltage_inst  power_inst  power_fact    thd
2     2022-04-14 23:15:00  1.0         0.0           0.0         12.06         241.0        0.83        0.28  22.56
6     2022-04-14 23:00:00  1.0         0.0           0.0         12.04         240.5        0.82        0.28  22.57
10    2022-04-14 22:45:00  1.0         0.0           0.0         12.04         240.2        0.82        0.28  22.56
14    2022-04-14 22:30:00  1.0         0.0           0.0         12.03         240.1        0.82        0.28  22.24
18    2022-04-14 22:15:00  1.0         0.0           0.0         12.01         240.1        0.82        0.28  22.52
22    2022-04-14 22:00:00  1.0         0.0           0.0         12.00         239.8        0.82        0.28  22.74
26    2022-04-14 21:45:00  1.0         0.0           0.0         11.96         239.9        0.82        0.28  22.58

...

CodePudding user response:

Consider running two read_xml calls, adjusting xpath and use attrs_only. And because the two will be same level (one <phases> at @id=1 for one <time>), join the result:

...
time_df = pd.read_xml(s, xpath="//time", attrs_only=True, names=["time"])
phase_df = pd.read_xml(s, xpath="//phase[@id=1]")

time_phase_df = time_df.join(phase_df)
time_phase_df
                     time  id  act_energy  ...  power_inst  power_fact    thd
0     2022-04-15 00:00:00   1           0  ...        0.84        0.28  22.35
1     2022-04-14 23:45:00   1           0  ...        0.83        0.28  23.16
2     2022-04-14 23:30:00   1           0  ...        0.83        0.28  22.43
3     2022-04-14 23:15:00   1           0  ...        0.83        0.28  22.56
4     2022-04-14 23:00:00   1           0  ...        0.82        0.28  22.57
                  ...  ..         ...  ...         ...         ...    ...
1289  2022-04-01 02:15:00   1           0  ...        0.69        0.25  22.70
1290  2022-04-01 02:00:00   1           0  ...        0.69        0.25  22.66
1291  2022-04-01 01:45:00   1           0  ...        0.69        0.25  22.46
1292  2022-04-01 01:30:00   1           0  ...        0.69        0.25  22.00
1293  2022-04-01 01:25:00   1           0  ...        0.69        0.25  22.34

And coming soon in Pandas 1.5, read_xml will support parsing dates:

time_df = pd.read_xml(
    s, xpath="//time", attrs_only=True, names=["time"], parse_dates=["value"]
)
  • Related