Using pandas.json_normalize to "unfold" a dictionary of a list of dictionaries-CodePudding

I am new to Python (and coding in general) so I'll do my best to explain the challenge I'm trying to work through.

I'm working with a large dataset which was exported as a CSV from a database. However, there is one column within this CSV export that contains a nested list of dictionaries (as best as I can tell). I've looked around extensively online for a solution, including on Stackoverflow, but haven't quite gotten a full solution. I think I understand conceptually what I'm trying to accomplish, but not clear as to the best method or data prepping process to use.

Here is an example of the data (pared down to just the two columns I'm interested in):

    {
       "app_ID": {
          "0": 1abe23574,
          "1": 4gbn21096
       },
       "locations": {
          "0": "[ {"loc_id" : "abc1",  "lat" : "12.3456",  "long" : "101.9876"  
                  },
                  {"loc_id" : "abc2",  "lat" : "45.7890",  "long" : "102.6543"} 
                ]",
          "1": "[ ]",
         ]"
       }
    }

Basically each app_ID can have multiple locations tied to a single ID, or it can be empty as seen above. I have attempted using some guides I found online using Panda's json_normalize() function to "unfold" or get the list of dictionaries into their own rows in a Panda dataframe.

I'd like to end up with something like this:

loc_id    lat      long       app_ID
abc1      12.3456  101.9876   1abe23574
abc1      45.7890  102.6543   1abe23574

etc...

I am learning about how to use the different functions of json_normalize, like "record_path" and "meta", but haven't been able to get it to work yet.

I have tried loading the json file into a Jupyter Notebook using:

with open('location_json.json', 'r') as f:
          data = json.loads(f.read())
df = pd.json_normalize(data, record_path = ['locations'])

but it only creates a dataframe that is 1 row and multiple columns long, where I'd like to have multiple rows generated from the inner-most dictionary that tie back to the app_ID and loc_ID fields.

Attempt at a solution:

I was able to get close to the dataframe format I wanted using:

with open('location_json.json', 'r') as f:
          data = json.loads(f.read())
df = pd.json_normalize(data['locations']['0'])

but that would then require some kind of iteration through the list in order to create a dataframe, and then I'd lose the connection to the app_ID fields. (As best as I can understand how the json_normalize function works).

Am I on the right track trying to use json_normalize, or should I start over again and try a different route? Any advice or guidance would be greatly appreciated.

CodePudding user response：

I can't say that suggesting you using convtools library is a good thing since you are a beginner, because this library is almost like another Python over the Python. It helps to dynamically define data conversions (generating Python code under the hood).

But anyway, here is the code if I understood the input data right:

import json
from convtools import conversion as c

data = {
    "app_ID": {"0": "1abe23574", "1": "4gbn21096"},
    "locations": {
        "0": """[ {"loc_id" : "abc1",  "lat" : "12.3456",  "long" : "101.9876" },
              {"loc_id" : "abc2",  "lat" : "45.7890",  "long" : "102.6543"} ]""",
        "1": "[ ]",
    },
}

# define it once and use multiple times
converter = (
    c.join(
        # converts "app_ID" data to iterable of dicts
        (
            c.item("app_ID")
            .call_method("items")
            .iter({"id": c.item(0), "app_id": c.item(1)})
        ),
        # converts "locations" data to iterable of dicts,
        # where each id like "0" is zipped to each location.
        # the result is iterable of dicts like {"id": "0", "loc": {"loc_id": ... }}
        (
            c.item("locations")
            .call_method("items")
            .iter(
                c.zip(id=c.repeat(c.item(0)), loc=c.item(1).pipe(json.loads))
            )
            .flatten()
        ),
        # join on "id"
        c.LEFT.item("id") == c.RIGHT.item("id"),
        how="full",
    )
    # process results, where 0 index is LEFT item, 1 index is the RIGHT one
    .iter(
        {
            "loc_id": c.item(1, "loc", "loc_id", default=None),
            "lat": c.item(1, "loc", "lat", default=None),
            "long": c.item(1, "loc", "long", default=None),
            "app_id": c.item(0, "app_id"),
        }
    )
    .as_type(list)
    .gen_converter()
)
result = converter(data)

assert result == [
    {'loc_id': 'abc1', 'lat': '12.3456', 'long': '101.9876', 'app_id': '1abe23574'},
    {'loc_id': 'abc2', 'lat': '45.7890', 'long': '102.6543', 'app_id': '1abe23574'},
    {'loc_id': None, 'lat': None, 'long': None, 'app_id': '4gbn21096'}
]