There dataframe below has columns with mixed types. Column of interest for expansion is "Info". Each row value in this column is a JSON object.
data = {'Code':['001', '002', '003', '004'],
'Info':['{"id":001,"x_cord":[1,1,1,1],"x_y_cord":[4.703978,-39.601876],"neutral":1,"code_h":"S38A46","group":null}','{"id":002,"x_cord":[2,1,3,1],"x_y_cord":[1.703978,-38.601876],"neutral":2,"code_h":"S17A46","group":"New"}','{"id":003,"x_cord":[1,1,4,1],"x_y_cord":[112.703978,-9.601876],"neutral":4,"code_h":"S12A46","group":"Old"}','{"id":004,"x_cord":[2,1,7,1],"x_y_cord":[6.703978,-56.601876],"neutral":1,"code_h":"S12A46","group":null}'],
'Region':['US','Pacific','Africa','Asia']}
df = pd.DataFrame(data)
I would like to have the headers expanded i.e. have "Info.id","info.x_y_cord","info.neutral" etc as individual columns with corresponding values under them across the dataset. I've tried normalizing them via pd.json_normalize(df["Info"]) iteration but nothing seems to change. Do I need to convert the column to another type first? Can someone point me to the right direction?
The output should be something like this:
data1 = {'Code':['001', '002', '003', '004'],
'Info.id':['001','002','003','004'],
'Info.x_cord':['[1,1,1,1]','[2,1,3,1]','[1,1,4,1]','[2,1,7,1]'],
'Info.x_y_cord':['[4.703978,-39.601876]','[1.703978,-38.601876]','[112.703978,-9.601876]','[6.703978,-56.601876]'],
'Info.neutral':[1,2,4,1],
'Info.code_h':['S38A46','S17A46','S12A46','S12A46'],
'Info.group':[np.NaN,"New","Old",np.NaN],
'Region':['US','Pacific','Africa','Asia']}
df_final = pd.DataFrame(data1)
CodePudding user response:
First of all, your JSON strings seem to be not valid because of the ID value. 001
is not processed correctly so you'll need to pass the "id" value as a string instead. Here's one way to do that:
def id_as_string(matchObj):
# Adds " around the ID value
return f"\"id\":\"{matchObj.group(1)}\","
df["Info"] = df["Info"].str.replace("\"id\":(\d*),", repl=id_to_string, regex=True))
Once you've done that, you can use pd.json_normalize
on your "Info" column after you've loaded the values from the JSON strings using json.loads
:
import json
json_part_df = pd.json_normalize(df["Info"].map(json.loads))
After that, just rename the columns and use pd.concat
to form the output dataframe:
# Rename columns
json_part_df.columns = [f"Info.{column}" for column in json_part_df.columns]
# Use pd.concat to create output
df = pd.concat([df[["Code", "Region"]], json_part_df], axis=1)