I need to flatten an XML data, then convert it into an JSON for pandas.
The issue is, I want the value of passage (<passage gare="87271460">
) to be at every line of the dataframe.
(For context : Gare is a train station, with an ID. I call the API like this https://api.transilien.com/gare/87271460/depart and I plan to call 17 differents train station)
I ran out of idea how I can do it. This is what I have so far
response = requests.request("GET", url, headers=headers, data=payload)
# print(response.text)
dict = xmltodict.parse(response.content)
# print(dict)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)
# print(json_object)
df = pd.json_normalize(json_object['passages'], record_path=['train'])
print(df)
This is my XML retrieved from the requests (after removing unwanted characters)
<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
<train>
<date mode="R">10/04/2022 14:05</date>
<num>PIST64</num>
<miss>PIST</miss>
<term>87758896</term>
</train>
<train>
<date mode="R">10/04/2022 14:09</date>
<num>KALI66</num>
<miss>KALI</miss>
<term>87393579</term>
</train>
</passages>
The final output I need is :
passage num miss term etat date.mode date.text
0 87271460 ERBE85 ERBE 87271486 Supprimé R 10/04/2022 16:09
1 87271460 PINS74 PINS 87758896 NaN R 10/04/2022 16:10
2 87271460 PINS80 PINS 87758896 NaN R 10/04/2022 16:17
3 87271460 KARE82 KARE 87758623 Supprimé R 10/04/2022 16:23
4 87271460 EPAU81 EPAU 87001479 NaN R 10/04/2022 16:29
5 87271460 ERIO91 ERIO 87001479 NaN R 10/04/2022 16:30
6 87271460 PINS86 PINS 87758896 NaN R 10/04/2022 16:32
7 87271460 KARE88 KARE 87393579 NaN R 10/04/2022 16:38
8 87271460 ERBE97 ERBE 87271486 Supprimé R 10/04/2022 16:39
9 87271460 EPAU93 EPAU 87001479 NaN R 10/04/2022 16:43
10 87271460 PINS92 PINS 87758896 NaN R 10/04/2022 16:47
11 87271460 EPIN99 EPIN 87001479 NaN R 10/04/2022 16:52
12 87271460 KARE94 KARE 87758623 Supprimé R 10/04/2022 16:53
13 87271460 ERAN67 ERAN 87001479 NaN R 10/04/2022 16:54
14 87271460 EPOL69 EPOL 87001479 NaN R 10/04/2022 17:01
15 87271460 PINS98 PINS 87758896 NaN R 10/04/2022 17:02
16 87271460 KABE02 KABE 87393579 NaN R 10/04/2022 17:08
17 87271460 ERAN73 ERAN 87001479 NaN R 10/04/2022 17:09
18 87271460 EPOL75 EPOL 87001479 NaN R 10/04/2022 17:16
19 87271460 PITA06 PITA 87758896 NaN R 10/04/2022 17:17
20 87271460 KABE08 KABE 87393579 NaN R 10/04/2022 17:23
21 87271460 ERAN79 ERAN 87001479 NaN R 10/04/2022 17:24
22 87271460 EPOL81 EPOL 87001479 NaN R 10/04/2022 17:31
23 87271460 PITA12 PITA 87758896 NaN R 10/04/2022 17:32
24 87271460 KABE14 KABE 87393579 NaN R 10/04/2022 17:38
25 87271460 ERAN85 ERAN 87001479 NaN R 10/04/2022 17:39
26 87271460 EPOL87 EPOL 87001479 NaN R 10/04/2022 17:46
27 87271460 PITA18 PITA 87758896 NaN R 10/04/2022 17:47
28 87271460 KABE20 KABE 87393579 NaN R 10/04/2022 17:53
29 87271460 ERAN91 ERAN 87001479 NaN R 10/04/2022 17:54
CodePudding user response:
You can create your Dataframe from the complete json (instead of just passage
) and then join the gare
column to the normalized train
column:
response = """<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
<train>
<date mode="R">10/04/2022 14:05</date>
<num>PIST64</num>
<miss>PIST</miss>
<term>87758896</term>
</train>
<train>
<date mode="R">10/04/2022 14:09</date>
<num>KALI66</num>
<miss>KALI</miss>
<term>87393579</term>
</train>
</passages>
"""
dict = xmltodict.parse(response)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)
df = pd.DataFrame.from_dict(json_object, orient='index')
df = df.explode('train').reset_index(drop=True)
df = df.join(pd.json_normalize(df['train'])).drop('train', 1)
print(df)
Output:
gare num miss term date.mode date.text
0 87271460 PIST64 PIST 87758896 R 10/04/2022 14:05
1 87271460 KALI66 KALI 87393579 R 10/04/2022 14:09