I have a dataframe in which is one of the column is a json object as shown below
customer_id | date | json_object
--------------------------------------------------------------------------
A101 | 2022-06-21 | {'name':['james'],'age':[55], 'hobby':['pubg']}
A102 | 2022-06-22 | {'name':['tarzan'],'status':[]}
The content of the jason object is not uniform. In the above example, the json object in the first row as 'hobby' which is not present in the json object of the second row. Similary in the 2nd row, the attribute status is empty i.e. []
Question: How can I flatten this dataframe in Python to create a new dataframe where each row corresponds to one json object only as shown below
customer_id | date | attribute
---------------------------------------------
A101 | 2022-06-21 | 'name': 'james'
A101 | 2022-06-21 | 'age': 55
A101 | 2022-06-21 | 'hobby': 'pubg'
A102 | 2022-06-22 | 'name': 'tarzan'
A102 | 2022-06-22 | 'status':
CodePudding user response:
Assuming each value of json_object
is a dict
, you could also use the following approach:
df = pd.DataFrame(
data = {
"customer_id": ["A101", "A102"],
"date": ["2022-06-21", "2022-06-22"],
"json_object": [{'name': 'james','age':55, 'hobby':'pubg'}, {'name': 'tarzan','status':'single'}]
}
)
df["json_object"] = df["json_object"].map(lambda x: [[i, x[i]] for i in x])
df = df.explode(column="json_object")
df.json_object = df.json_object.str[0].astype(str) ": " df.json_object.str[1].astype(str)
df
------------------------------------------
customer_id date json_object
0 A101 2022-06-21 name: james
0 A101 2022-06-21 age: 55
0 A101 2022-06-21 hobby: pubg
1 A102 2022-06-22 name: tarzan
1 A102 2022-06-22 status: single
------------------------------------------
EDIT
Since you changed your data frame to
df = pd.DataFrame(
data = {
"customer_id": ["A101", "A102"],
"date": ["2022-06-21", "2022-06-22"],
"json_object": [{'name': ['james'],'age':[55], 'hobby':['pubg']}, {'name': ['tarzan'],'status':['single']}]
}
)
my code must be adjusted as follows:
df = pd.DataFrame(
data = {
"customer_id": ["A101", "A102"],
"date": ["2022-06-21", "2022-06-22"],
"json_object": [{'name': ['james'],'age':[55], 'hobby':['pubg']}, {'name': ['tarzan'],'status':['single']}]
}
)
df["json_object"] = df["json_object"].map(lambda x: [[i, x[i][0]] for i in x])
df = df.explode(column="json_object")
df.json_object = df.json_object.str[0].astype(str) ": " df.json_object.str[1].astype(str)
df
If empty lists are included then simply add an if-else
condition inside the lambda
function. Note, I have also renamed the columns in the next code extraction.
df = pd.DataFrame(
data = {
"customer_id": ["A101", "A102"],
"date": ["2022-06-21", "2022-06-22"],
"json_object": [{'name': ['james'],'age':[55], 'hobby':['pubg']}, {'name': ['tarzan'],'status':[]}]
}
)
df["json_object"] = df["json_object"].map(lambda x: [[i, x[i][0]] if x[i] else [i, ""] for i in x])
df = df.rename(columns={"json_object": "attribute"}).explode(column="attribute")
df.attribute = df.attribute.str[0].astype(str) ": " df.attribute.str[1].astype(str)
CodePudding user response:
I hope I understood you right:
from ast import literal_eval
df["json_object"] = df["json_object"].apply(lambda x: literal_eval(x).items())
df = df.explode("json_object")
After this the df
will be:
customer_id date json_object
0 A101 2022-06-21 (name, james)
0 A101 2022-06-21 (age, 55)
0 A101 2022-06-21 (hobby, pubg)
1 A102 2022-06-22 (name, tarzan)
1 A102 2022-06-22 (status, single)
Then:
df["attribute"] = df["json_object"].apply(lambda x: "{}: {}".format(*x))
df = df.drop(columns="json_object")
print(df)
Prints:
customer_id date attribute
0 A101 2022-06-21 name: james
0 A101 2022-06-21 age: 55
0 A101 2022-06-21 hobby: pubg
1 A102 2022-06-22 name: tarzan
1 A102 2022-06-22 status: single
CodePudding user response:
This is not the data structure you asked for and it depends on your subsequent steps, but it is generally a good idea to put one value into one cell and don't mix different data in a single column. So with the data
data = {"customer_id": "A101 A102".split(),
"date": "2022-06-21 2022-06-22".split(),
"json_object": [{'name':['james'], 'age':[55], 'hobby':['pubg']}, {'name':['tarzan'],'status':[]}]}
df = pd.DataFrame(pd.DataFrame(data))
and the df
customer_id date json_object
0 A101 2022-06-21 {'name': ['james'], 'age': [55], 'hobby': ['pu...
1 A102 2022-06-22 {'name': ['tarzan'], 'status': []}
you could also do something like
pd.concat([df, pd.DataFrame.from_records(df.json_object)], axis=1).drop("json_object", axis=1)
which will give you the data in separate columns, (almost) ready for further processing.
customer_id date name age hobby status
0 A101 2022-06-21 [james] [55] [pubg] NaN
1 A102 2022-06-22 [tarzan] NaN NaN []