How to flatten a data frame in Python in which one column contains a json object?-CodePudding

I have a dataframe in which is one of the column is a json object as shown below

customer_id |    date    |             json_object
--------------------------------------------------------------------------
A101        | 2022-06-21 | {'name':['james'],'age':[55], 'hobby':['pubg']}
A102        | 2022-06-22 | {'name':['tarzan'],'status':[]}

The content of the jason object is not uniform. In the above example, the json object in the first row as 'hobby' which is not present in the json object of the second row. Similary in the 2nd row, the attribute status is empty i.e. []

Question: How can I flatten this dataframe in Python to create a new dataframe where each row corresponds to one json object only as shown below

customer_id |    date    | attribute
---------------------------------------------
A101        | 2022-06-21 | 'name': 'james'
A101        | 2022-06-21 | 'age': 55
A101        | 2022-06-21 | 'hobby': 'pubg'
A102        | 2022-06-22 | 'name': 'tarzan'
A102        | 2022-06-22 | 'status':

CodePudding user response：

Assuming each value of json_object is a dict, you could also use the following approach:

df = pd.DataFrame(
    data = {
        "customer_id": ["A101", "A102"],
        "date": ["2022-06-21", "2022-06-22"],
        "json_object": [{'name': 'james','age':55, 'hobby':'pubg'}, {'name': 'tarzan','status':'single'}]
    }
)
df["json_object"] = df["json_object"].map(lambda x: [[i, x[i]] for i in x])
df = df.explode(column="json_object")
df.json_object = df.json_object.str[0].astype(str)   ": "   df.json_object.str[1].astype(str) 
df

------------------------------------------
    customer_id  date        json_object
0   A101         2022-06-21  name: james
0   A101         2022-06-21  age: 55
0   A101         2022-06-21  hobby: pubg
1   A102         2022-06-22  name: tarzan
1   A102         2022-06-22  status: single
------------------------------------------

EDIT

Since you changed your data frame to

df = pd.DataFrame(
    data = {
        "customer_id": ["A101", "A102"],
        "date": ["2022-06-21", "2022-06-22"],
        "json_object": [{'name': ['james'],'age':[55], 'hobby':['pubg']}, {'name': ['tarzan'],'status':['single']}]
    }
)

my code must be adjusted as follows:

df = pd.DataFrame(
    data = {
        "customer_id": ["A101", "A102"],
        "date": ["2022-06-21", "2022-06-22"],
        "json_object": [{'name': ['james'],'age':[55], 'hobby':['pubg']}, {'name': ['tarzan'],'status':['single']}]
    }
)
df["json_object"] = df["json_object"].map(lambda x: [[i, x[i][0]] for i in x])
df = df.explode(column="json_object")
df.json_object = df.json_object.str[0].astype(str)   ": "   df.json_object.str[1].astype(str) 
df

If empty lists are included then simply add an if-else condition inside the lambda function. Note, I have also renamed the columns in the next code extraction.

df = pd.DataFrame(
    data = {
        "customer_id": ["A101", "A102"],
        "date": ["2022-06-21", "2022-06-22"],
        "json_object": [{'name': ['james'],'age':[55], 'hobby':['pubg']}, {'name': ['tarzan'],'status':[]}]
    }
)
df["json_object"] = df["json_object"].map(lambda x: [[i, x[i][0]] if x[i] else [i, ""] for i in x])
df = df.rename(columns={"json_object": "attribute"}).explode(column="attribute")
df.attribute = df.attribute.str[0].astype(str)   ": "   df.attribute.str[1].astype(str)

CodePudding user response：

I hope I understood you right:

from ast import literal_eval

df["json_object"] = df["json_object"].apply(lambda x: literal_eval(x).items())
df = df.explode("json_object")

After this the df will be:

  customer_id        date       json_object
0        A101  2022-06-21     (name, james)
0        A101  2022-06-21         (age, 55)
0        A101  2022-06-21     (hobby, pubg)
1        A102  2022-06-22    (name, tarzan)
1        A102  2022-06-22  (status, single)

Then:

df["attribute"] = df["json_object"].apply(lambda x: "{}: {}".format(*x))
df = df.drop(columns="json_object")

print(df)

Prints:

  customer_id        date       attribute
0        A101  2022-06-21     name: james
0        A101  2022-06-21         age: 55
0        A101  2022-06-21     hobby: pubg
1        A102  2022-06-22    name: tarzan
1        A102  2022-06-22  status: single

CodePudding user response：

This is not the data structure you asked for and it depends on your subsequent steps, but it is generally a good idea to put one value into one cell and don't mix different data in a single column. So with the data

data = {"customer_id": "A101 A102".split(), 
        "date": "2022-06-21 2022-06-22".split(), 
        "json_object": [{'name':['james'], 'age':[55], 'hobby':['pubg']}, {'name':['tarzan'],'status':[]}]}

df = pd.DataFrame(pd.DataFrame(data))

and the df

  customer_id        date                                        json_object
0        A101  2022-06-21  {'name': ['james'], 'age': [55], 'hobby': ['pu...
1        A102  2022-06-22                 {'name': ['tarzan'], 'status': []}

you could also do something like

pd.concat([df, pd.DataFrame.from_records(df.json_object)], axis=1).drop("json_object", axis=1)

which will give you the data in separate columns, (almost) ready for further processing.

  customer_id        date      name   age   hobby status
0        A101  2022-06-21   [james]  [55]  [pubg]    NaN
1        A102  2022-06-22  [tarzan]   NaN     NaN     []