Normalization and flattening of JSON column in a mixed type dataframe-CodePudding

There dataframe below has columns with mixed types. Column of interest for expansion is "Info". Each row value in this column is a JSON object.

data = {'Code':['001', '002', '003', '004'],
        'Info':['{"id":001,"x_cord":[1,1,1,1],"x_y_cord":[4.703978,-39.601876],"neutral":1,"code_h":"S38A46","group":null}','{"id":002,"x_cord":[2,1,3,1],"x_y_cord":[1.703978,-38.601876],"neutral":2,"code_h":"S17A46","group":"New"}','{"id":003,"x_cord":[1,1,4,1],"x_y_cord":[112.703978,-9.601876],"neutral":4,"code_h":"S12A46","group":"Old"}','{"id":004,"x_cord":[2,1,7,1],"x_y_cord":[6.703978,-56.601876],"neutral":1,"code_h":"S12A46","group":null}'],        
        'Region':['US','Pacific','Africa','Asia']}
df = pd.DataFrame(data)

I would like to have the headers expanded i.e. have "Info.id","info.x_y_cord","info.neutral" etc as individual columns with corresponding values under them across the dataset. I've tried normalizing them via pd.json_normalize(df["Info"]) iteration but nothing seems to change. Do I need to convert the column to another type first? Can someone point me to the right direction?

The output should be something like this:

data1 = {'Code':['001', '002', '003', '004'],
        'Info.id':['001','002','003','004'],
        'Info.x_cord':['[1,1,1,1]','[2,1,3,1]','[1,1,4,1]','[2,1,7,1]'],
        'Info.x_y_cord':['[4.703978,-39.601876]','[1.703978,-38.601876]','[112.703978,-9.601876]','[6.703978,-56.601876]'],
        'Info.neutral':[1,2,4,1],
        'Info.code_h':['S38A46','S17A46','S12A46','S12A46'],
        'Info.group':[np.NaN,"New","Old",np.NaN],
        'Region':['US','Pacific','Africa','Asia']}       
     
df_final = pd.DataFrame(data1)

CodePudding user response：

First of all, your JSON strings seem to be not valid because of the ID value. 001 is not processed correctly so you'll need to pass the "id" value as a string instead. Here's one way to do that:

def id_as_string(matchObj):
    # Adds " around the ID value
    return f"\"id\":\"{matchObj.group(1)}\","

df["Info"] = df["Info"].str.replace("\"id\":(\d*),", repl=id_to_string, regex=True))

Once you've done that, you can use pd.json_normalize on your "Info" column after you've loaded the values from the JSON strings using json.loads:

import json

json_part_df = pd.json_normalize(df["Info"].map(json.loads))

After that, just rename the columns and use pd.concat to form the output dataframe:

# Rename columns
json_part_df.columns = [f"Info.{column}" for column in json_part_df.columns]

# Use pd.concat to create output
df = pd.concat([df[["Code", "Region"]], json_part_df], axis=1)