Home > database >  How do I unpack multiple levels using json_normalize in python pandas?
How do I unpack multiple levels using json_normalize in python pandas?

Time:06-15

import pandas as pd
from pandas import json_normalize

Suppose I have the following data:

data = [
    {
        'Name':'Rocco',
        'Year':2020,
        'Location':'Itay',
        'Source':[
            {'Movies': 
             {'MovieNumber':1, 'Money':1000, 'Percent':10}
            }
        ]
    },
    {
        'Name':'Anja',
        'Year':2021,
        'Location':'Germany',
        'Source':[
            {'Movies': [
                {'MovieNumber':1, 'Money':2000, 'Percent':10},
                {'MovieNumber':2, 'Money':3000, 'Percent':10}
            ]
            }
        ]
    },
    {
        'Name':'Kasia',
        'Year':2021,
        'Location':'Poland',
        'Source':[
            {'Movies': [
                {'MovieNumber':1, 'Money':1000, 'Percent':10},
                {'MovieNumber':2, 'Money':1000, 'Percent':10},
                {'MovieNumber':3, 'Money':1000, 'Percent':10}
            ]
            }
        ]
    }
]

This is what it looks like as a pandas DF:

df = pd.DataFrame(data)
df

df1

This was my first attempt at unpacking:

results = pd.json_normalize(
    data,
    record_path = ['Source'],
    meta = ['Year']
)

results

This only did the first level so Rocco's information came out correct but Anja's and Kasia's did not because the data structure inside Source changed to a list.

My next attempt involved the following but it too did not work:

results2 = pd.json_normalize(
    data,
    record_path = ['Source'],
    meta = [
        'Year',
        ['Source'],
           ]
)

results2

From looking at the documentation and YouTube videos, I think I need another element in both the record_path field and meta field but I don't know what would go there. I've tried different combinations but none of them worked.

Does anyone know how I can unpack this? I would like the end result to be a DF with the following columns: Year, Name, Location, Movies.MovieNumber, Movies.Money, Movies.Percent. So, there should be a total of 6 rows.

CodePudding user response:

Fix your dictionary first, it's not consistent, this makes it consistent:

for i, x in enumerate(data):
    x = x['Source'][0]['Movies']
    if not isinstance(x, list):
        data[i]['Source'][0]['Movies'] = [x]

Then json_normalize works just fine:

df = pd.json_normalize(data, ['Source','Movies'], ['Name', 'Year', 'Location'])
print(df)

Output:

   MovieNumber  Money  Percent   Name  Year Location
0            1   1000       10  Rocco  2020     Itay
1            1   2000       10   Anja  2021  Germany
2            2   3000       10   Anja  2021  Germany
3            1   1000       10  Kasia  2021   Poland
4            2   1000       10  Kasia  2021   Poland
5            3   1000       10  Kasia  2021   Poland
  • Related