How to create a multi index dataframe from a nested dictionary?-CodePudding

I have a nested dictionary, whose first level keys are [0, 1, 2...] and the corresponding values of each key are of the form:

{
    "geometry": {
        "type": "Point",
        "coordinates": [75.4516454, 27.2520587]
    },
    "type": "Feature",
    "properties": {
        "state": "Rajasthan",
        "code": "BDHL",
        "name": "Badhal",
        "zone": "NWR",
        "address": "Kishangarh Renwal, Rajasthan"
    }
}

I want to make a pandas dataframe of the form:

        Geometry           Type                    Properties
   Type      Coordinates           State     Code    Name    Zone    Address
0  Point     [..., ...]   Features Rajasthan BDHL    ...     ...     ...
1
2

I am not able to understand the examples over the net about multi indexing/nested dataframe/pivoting. None of them seem to take the first level keys as the primary index in the required dataframe.

How do I get from the data I have, to making it into this formatted dataframe?

CodePudding user response：

I would suggest to create columns as "geometry_type", "geometry_coord", etc.. in order to differentiate theses columns from the column that you would name "type". In other words, using your first key as a prefix, and the subkey as the name, and hence creating a new name. And after, just parse and fill your Dataframe like that:

import json
j = json.loads("your_json.json")

df = pd.DataFrame(columns=["geometry_type", "geometry_coord", ... ])

for k, v in j.items():
    if k == "geometry":
        df = df.append({
            "geometry_type": v.get("type"),
            "geometry_coord": v.get("coordinates")
        }, ignore_index=True)
    ...

Your output could then looks like this :

    geometry_type               geometry_coord    ...
0   [75.4516454, 27.2520587]    NaN               ...

PS : And if you really want to go for your initial option, you could check here : Giving a column multiple indexes/headers

CodePudding user response：

I suppose you have a list of nested dictionaries.

Use json_normalize to read json data and split current column index into 2 part using str.partition:

import pandas as pd
import json

data = json.load(open('data.json'))
df = pd.json_normalize(data)
df.columns = df.columns.str.partition('.', expand=True).droplevel(level=1)

Output:

>>> df.columns
MultiIndex([(      'type',            ''),
            (  'geometry',        'type'),
            (  'geometry', 'coordinates'),
            ('properties',       'state'),
            ('properties',        'code'),
            ('properties',        'name'),
            ('properties',        'zone'),
            ('properties',     'address')],
           )

>>> df
      type geometry                           properties                     
               type               coordinates      state  code    name zone                        address   
0  Feature    Point  [75.4516454, 27.2520587]  Rajasthan  BDHL  Badhal  NWR   Kishangarh Renwal, Rajasthan

CodePudding user response：

You can use pd.json_normalize() to normalize the nested dictionary into a dataframe df.

Then, split the column names with dots into multi-index with Index.str.split on df.columns with parameter expand=True, as follows:

Step 1: Normalize nested dict into a dataframe

j = {
    "geometry": {
        "type": "Point",
        "coordinates": [75.4516454, 27.2520587]
    },
    "type": "Feature",
    "properties": {
        "state": "Rajasthan",
        "code": "BDHL",
        "name": "Badhal",
        "zone": "NWR",
        "address": "Kishangarh Renwal, Rajasthan"
    }
} 

df = pd.json_normalize(j)

Step 1 Result:

print(df)

      type geometry.type      geometry.coordinates properties.state properties.code properties.name properties.zone            properties.address
0  Feature         Point  [75.4516454, 27.2520587]        Rajasthan            BDHL          Badhal             NWR  Kishangarh Renwal, Rajasthan

Step 2: Create Multi-index column labels

df.columns = df.columns.str.split('.', expand=True)

Step 2 (Final) Result:

print(df)

      type geometry                           properties                                                 
       NaN     type               coordinates      state  code    name zone                       address
0  Feature    Point  [75.4516454, 27.2520587]  Rajasthan  BDHL  Badhal  NWR  Kishangarh Renwal, Rajasthan