Home > Net >  (Pandas/Dataframe) pandas.json_normalize on nested JSON data without uniform record_path
(Pandas/Dataframe) pandas.json_normalize on nested JSON data without uniform record_path

Time:11-27

I'm attempting to convert a large JSON file to a CSV, but the field that I need to be able to sort data on in the Spreadsheet is all in one cell whenever I convert it to CSV/Normalize the JSON. The main thing I need is the hits list of dictionaries not all be in the same cell when I convert it to a csv. (Structure is: a Dictionary of Dictionaries which contains a List of Dictionaries)

Here's an example of what the JSON would look like: https://pastebin.com/VA5mfhfB

Here's how I've tried doing it (and what gives somewhat of an output):

df = pd.json_normalize(boss_dictionary)
df.to_csv(r'data.csv', index=None)

I've tried putting a record_path parameter, but because there isn't a "uniform" boss_id (the slew of numbers beforehand), I can't figure out how to normalize the hits list of dictionaries.

Another thing that I've tried:

df = pd.read_json('data.json')
df.to_csv(r'data.csv', index=None)

Which does something similar to what I need, but not what I actually need. The hit list is just in one cell instead of being normalized out.

What I've tried to fix it:

I've tried to normalize it with the dictionary itself, and read it from JSON. I've read the documentation on json_normalize, but no parameters of meta or record_path netted me any result that didn't raise an exception.

CodePudding user response:

Using json_normalize with in a list comp based off keys. Finally merge and explode.

from ast import literal_eval

import pandas as pd


data = literal_eval(open("/path/to/file/data.txt").read())

df_main = (
    pd
    .concat([pd.json_normalize(data=data[x]) for x in data], keys=data.keys())
    .droplevel(level=1)
    .reset_index(names="id")
)
df_main.columns = df_main.columns.str.split(".").str[-1]

df_hits = (
    pd
    .concat([pd.json_normalize(data=data[x], record_path=["hits"]) for x in data], keys=data.keys())
    .droplevel(level=1)
    .reset_index(names="id")
)
df_hits.columns = df_hits.columns.str.split(".").str[-1]

df_final = pd.merge(left=df_main, right=df_hits).drop(columns="hits")
df_final = df_final.explode("hp_list").reset_index(drop=True)
  • Related