How to normalize the column which contains JSON in data frame and get a complete data frame-CodePudding

I have a pandas dataframe in which one column contains JSON data

Student_Id	V_Id	Json_result
32101	35	[{"q_id":"8007","q_text":"வேறுபட்ட பொம்மை எது?","q_img":"","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"105","MC":"1","LO Text":"ஓவியம் மற்றும் படங்களின் வெளிப்படையான மற்றும் மறைமுகமான கூறுகளை நுட்பமாக உற்று நோக்குதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"},{"q_id":"8008","q_text":"","q_img":"8008_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"106","MC":"1","LO Text":"கதை, சூழல், நிகழ்வைத் தொடர்ச்சியான படங்கள் மற்றும் இவற்றில் இடம் பெறும் செயல்பாடுகள் பற்றி பேசுதல்.","notes":"(படம் பார்த்துக் கதையை மிகச் சரியாகக் கூறினால் 'சிறப்பு', சரியாகக் கூறினால் 'அருமை', கதையைக் கூறவில்லை என்றால் 'சிந்திக்க' என்பதைத் தன்னார்வலர் தேர்ந்தெடுக்கவும்.)","isAnswered":"1","correctAnswerId":["1","2"],"isAnswerCorrect":"","answer":"3"},{"q_id":"8009","q_text":"","q_img":"8009_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"109","MC":"1","LO Text":"அச்சடிக்கப்பட்ட குறிப்பிட்ட எழுத்தை அடையாளம் காணுதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"}]
32102	35	[{"q_id":"8007","q_text":"வேறுபட்ட பொம்மை எது?","q_img":"","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"105","MC":"1","LO Text":"ஓவியம் மற்றும் படங்களின் வெளிப்படையான மற்றும் மறைமுகமான கூறுகளை நுட்பமாக உற்று நோக்குதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"},{"q_id":"8008","q_text":"","q_img":"8008_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"106","MC":"1","LO Text":"கதை, சூழல், நிகழ்வைத் தொடர்ச்சியான படங்கள் மற்றும் இவற்றில் இடம் பெறும் செயல்பாடுகள் பற்றி பேசுதல்.","notes":"(படம் பார்த்துக் கதையை மிகச் சரியாகக் கூறினால் 'சிறப்பு', சரியாகக் கூறினால் 'அருமை', கதையைக் கூறவில்லை என்றால் 'சிந்திக்க' என்பதைத் தன்னார்வலர் தேர்ந்தெடுக்கவும்.)","isAnswered":"1","correctAnswerId":["1","2"],"isAnswerCorrect":"","answer":"3"},{"q_id":"8009","q_text":"","q_img":"8009_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"109","MC":"1","LO Text":"அச்சடிக்கப்பட்ட குறிப்பிட்ட எழுத்தை அடையாளம் காணுதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"}]

I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe. There are more than 40k rows in the dataframe.

The json sample in a single row is in the form as follows

[
  {
    "q_id": "8007",
    "q_text": "வேறுபட்ட பொம்மை எது?",
    "q_img": "",
    "subject": "Tamil",
    "q_medium": "Tamil",
    "Skill": "0",
    "Class": "1std",
    "LO ID": "105",
    "MC": "1",
    "LO Text": "ஓவியம் மற்றும் படங்களின் வெளிப்படையான மற்றும் மறைமுகமான கூறுகளை நுட்பமாக உற்று நோக்குதல்.",
    "notes": "",
    "isAnswered": "1",
    "correctAnswerId": [
      "1"
    ],
    "isAnswerCorrect": "1",
    "answer": "1"
  },
  {
    "q_id": "8008",
    "q_text": "",
    "q_img": "8008_Set_3.png",
    "subject": "Tamil",
    "q_medium": "Tamil",
    "Skill": "0",
    "Class": "1std",
    "LO ID": "106",
    "MC": "1",
    "LO Text": "கதை, சூழல், நிகழ்வைத் தொடர்ச்சியான படங்கள் மற்றும் இவற்றில் இடம் பெறும் செயல்பாடுகள் பற்றி பேசுதல்.",
    "notes": "(படம் பார்த்துக் கதையை மிகச் சரியாகக் கூறினால் 'சிறப்பு', சரியாகக் கூறினால் 'அருமை', கதையைக் கூறவில்லை என்றால் 'சிந்திக்க' என்பதைத் தன்னார்வலர் தேர்ந்தெடுக்கவும்.)",
    "isAnswered": "1",
    "correctAnswerId": [
      "1",
      "2"
    ],
    "isAnswerCorrect": "",
    "answer": "3"
  },
  {
    "q_id": "8009",
    "q_text": "",
    "q_img": "8009_Set_3.png",
    "subject": "Tamil",
    "q_medium": "Tamil",
    "Skill": "0",
    "Class": "1std",
    "LO ID": "109",
    "MC": "1",
    "LO Text": "அச்சடிக்கப்பட்ட குறிப்பிட்ட எழுத்தை அடையாளம் காணுதல்.",
    "notes": "",
    "isAnswered": "1",
    "correctAnswerId": [
      "1"
    ],
    "isAnswerCorrect": "1",
    "answer": "1"
  }
]

I want to link the student for the json q_id and want an output as follows

Student_Id	V_Id	q_id	subject	q_medium	Class	LO_ID	isAnswered	correctAnswerId	isAnswerCorrect	answer
32101	35	8007	Tamil	Tamil	1std	105	1	1	1	1
32101	35	8008	Tamil	Tamil	1std	106	1	[1,2]	-	3
32101	35	8009	Tamil	Tamil	1std	109	1	1	1	1
32102	35	8007	Tamil	Tamil	1std	105	1	1	1	1
32102	35	8008	Tamil	Tamil	1std	106	1	[1,2]	-	3
32102	35	8009	Tamil	Tamil	1std	109	1	1	1	1

Like this I want to get the dataframe for 40k ID and rows. How do I write in python to get this kind of data frame?

CodePudding user response：

You may start by using df.explode() and using loop and .apply(lambda) to get the value of each key in Json Result, as shown in example below

import json

df['Json_result'] = df['Json_result'].apply(lambda x: json.loads(x))
df = df.explode('Json_result')
keys = df['Json_result'].tolist()[0].keys() # Get the list of keys in json
for column in keys: # loop to create new column by getting the value from the dict
    df[column] = df['Json_result'].apply(lambda x: x.get(column, None))