Home > OS >  Flatten JSON to pandas using json_normalize
Flatten JSON to pandas using json_normalize

Time:04-11

I need to flatten an XML data, then convert it into an JSON for pandas. The issue is, I want the value of passage (<passage gare="87271460">) to be at every line of the dataframe.

(For context : Gare is a train station, with an ID. I call the API like this https://api.transilien.com/gare/87271460/depart and I plan to call 17 differents train station)

I ran out of idea how I can do it. This is what I have so far

    response = requests.request("GET", url, headers=headers, data=payload)
    # print(response.text)
    dict = xmltodict.parse(response.content)
    # print(dict)
    s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
    json_object = json.loads(s)
    # print(json_object)
    df = pd.json_normalize(json_object['passages'], record_path=['train'])
    print(df)

This is my XML retrieved from the requests (after removing unwanted characters)

<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
    <train>
        <date mode="R">10/04/2022 14:05</date>
        <num>PIST64</num>
        <miss>PIST</miss>
        <term>87758896</term>
    </train>
    <train>
        <date mode="R">10/04/2022 14:09</date>
        <num>KALI66</num>
        <miss>KALI</miss>
        <term>87393579</term>
    </train>
</passages>

The final output I need is :

    passage   num     miss      term      etat date.mode         date.text
0   87271460  ERBE85  ERBE  87271486  Supprimé         R  10/04/2022 16:09
1   87271460  PINS74  PINS  87758896       NaN         R  10/04/2022 16:10
2   87271460  PINS80  PINS  87758896       NaN         R  10/04/2022 16:17
3   87271460  KARE82  KARE  87758623  Supprimé         R  10/04/2022 16:23
4   87271460  EPAU81  EPAU  87001479       NaN         R  10/04/2022 16:29
5   87271460  ERIO91  ERIO  87001479       NaN         R  10/04/2022 16:30
6   87271460  PINS86  PINS  87758896       NaN         R  10/04/2022 16:32
7   87271460  KARE88  KARE  87393579       NaN         R  10/04/2022 16:38
8   87271460  ERBE97  ERBE  87271486  Supprimé         R  10/04/2022 16:39
9   87271460  EPAU93  EPAU  87001479       NaN         R  10/04/2022 16:43
10  87271460  PINS92  PINS  87758896       NaN         R  10/04/2022 16:47
11  87271460  EPIN99  EPIN  87001479       NaN         R  10/04/2022 16:52
12  87271460  KARE94  KARE  87758623  Supprimé         R  10/04/2022 16:53
13  87271460  ERAN67  ERAN  87001479       NaN         R  10/04/2022 16:54
14  87271460  EPOL69  EPOL  87001479       NaN         R  10/04/2022 17:01
15  87271460  PINS98  PINS  87758896       NaN         R  10/04/2022 17:02
16  87271460  KABE02  KABE  87393579       NaN         R  10/04/2022 17:08
17  87271460  ERAN73  ERAN  87001479       NaN         R  10/04/2022 17:09
18  87271460  EPOL75  EPOL  87001479       NaN         R  10/04/2022 17:16
19  87271460  PITA06  PITA  87758896       NaN         R  10/04/2022 17:17
20  87271460  KABE08  KABE  87393579       NaN         R  10/04/2022 17:23
21  87271460  ERAN79  ERAN  87001479       NaN         R  10/04/2022 17:24
22  87271460  EPOL81  EPOL  87001479       NaN         R  10/04/2022 17:31
23  87271460  PITA12  PITA  87758896       NaN         R  10/04/2022 17:32
24  87271460  KABE14  KABE  87393579       NaN         R  10/04/2022 17:38
25  87271460  ERAN85  ERAN  87001479       NaN         R  10/04/2022 17:39
26  87271460  EPOL87  EPOL  87001479       NaN         R  10/04/2022 17:46
27  87271460  PITA18  PITA  87758896       NaN         R  10/04/2022 17:47
28  87271460  KABE20  KABE  87393579       NaN         R  10/04/2022 17:53
29  87271460  ERAN91  ERAN  87001479       NaN         R  10/04/2022 17:54

CodePudding user response:

You can create your Dataframe from the complete json (instead of just passage) and then join the gare column to the normalized train column:

response = """<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
    <train>
        <date mode="R">10/04/2022 14:05</date>
        <num>PIST64</num>
        <miss>PIST</miss>
        <term>87758896</term>
    </train>
    <train>
        <date mode="R">10/04/2022 14:09</date>
        <num>KALI66</num>
        <miss>KALI</miss>
        <term>87393579</term>
    </train>
</passages>
"""

dict = xmltodict.parse(response)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)

df = pd.DataFrame.from_dict(json_object, orient='index')
df = df.explode('train').reset_index(drop=True)
df = df.join(pd.json_normalize(df['train'])).drop('train', 1)
print(df)

Output:

       gare     num  miss      term date.mode         date.text
0  87271460  PIST64  PIST  87758896         R  10/04/2022 14:05
1  87271460  KALI66  KALI  87393579         R  10/04/2022 14:09
  • Related