Home > Software design >  Normalization and flattening of JSON column in a mixed type dataframe
Normalization and flattening of JSON column in a mixed type dataframe

Time:03-22

There dataframe below has columns with mixed types. Column of interest for expansion is "Info". Each row value in this column is a JSON object.

data = {'Code':['001', '002', '003', '004'],
        'Info':['{"id":001,"x_cord":[1,1,1,1],"x_y_cord":[4.703978,-39.601876],"neutral":1,"code_h":"S38A46","group":null}','{"id":002,"x_cord":[2,1,3,1],"x_y_cord":[1.703978,-38.601876],"neutral":2,"code_h":"S17A46","group":"New"}','{"id":003,"x_cord":[1,1,4,1],"x_y_cord":[112.703978,-9.601876],"neutral":4,"code_h":"S12A46","group":"Old"}','{"id":004,"x_cord":[2,1,7,1],"x_y_cord":[6.703978,-56.601876],"neutral":1,"code_h":"S12A46","group":null}'],        
        'Region':['US','Pacific','Africa','Asia']}
df = pd.DataFrame(data)

I would like to have the headers expanded i.e. have "Info.id","info.x_y_cord","info.neutral" etc as individual columns with corresponding values under them across the dataset. I've tried normalizing them via pd.json_normalize(df["Info"]) iteration but nothing seems to change. Do I need to convert the column to another type first? Can someone point me to the right direction?

The output should be something like this:

data1 = {'Code':['001', '002', '003', '004'],
        'Info.id':['001','002','003','004'],
        'Info.x_cord':['[1,1,1,1]','[2,1,3,1]','[1,1,4,1]','[2,1,7,1]'],
        'Info.x_y_cord':['[4.703978,-39.601876]','[1.703978,-38.601876]','[112.703978,-9.601876]','[6.703978,-56.601876]'],
        'Info.neutral':[1,2,4,1],
        'Info.code_h':['S38A46','S17A46','S12A46','S12A46'],
        'Info.group':[np.NaN,"New","Old",np.NaN],
        'Region':['US','Pacific','Africa','Asia']}       
     
df_final = pd.DataFrame(data1)

CodePudding user response:

First of all, your JSON strings seem to be not valid because of the ID value. 001 is not processed correctly so you'll need to pass the "id" value as a string instead. Here's one way to do that:

def id_as_string(matchObj):
    # Adds " around the ID value
    return f"\"id\":\"{matchObj.group(1)}\","

df["Info"] = df["Info"].str.replace("\"id\":(\d*),", repl=id_to_string, regex=True))

Once you've done that, you can use pd.json_normalize on your "Info" column after you've loaded the values from the JSON strings using json.loads:

import json

json_part_df = pd.json_normalize(df["Info"].map(json.loads))

After that, just rename the columns and use pd.concat to form the output dataframe:

# Rename columns
json_part_df.columns = [f"Info.{column}" for column in json_part_df.columns]

# Use pd.concat to create output
df = pd.concat([df[["Code", "Region"]], json_part_df], axis=1)
  • Related