I am new to Python. I have dataframe obtained from SQL query result
UserId | UserName | Reason_details |
---|---|---|
851 | Bob | [ {"reasonId":264, "reasonDescription":"prohibited", "reasonCodes":[1 , 2]} , {"reasonId":267, "reasonDescription":"Expired", "reasonCodes":[25]} ] |
852 | Jack | [{"reasonId":273, "reasonDescription":"Restricted", "reasonCodes":[29]}] |
I want to modify this dataframe by flattening Reason_details column. Each reason in new row.
UserId | UserName | Reason_id | Reson_description | Reason_codes |
---|---|---|---|---|
851 | Bob | 264 | Prohibited | 1 |
851 | Bob | 264 | Prohibited | 2 |
851 | Bob | 267 | Expired | 25 |
852 | Jack | 273 | Restricted | 29 |
I flattened this data using good old for loops iterating over each row of source dataframe, reading value of each key in Reason_details
column by using json_loads
. And then creating final dataframe.
But I feel there has to be better way of doing this using dataframe and JSON functions in python.
PS: In my actual dataset there are 63 columns and 8 million rows out of which only Reason_details column has JSON value. Thus my existing approach is very inefficient iteration over all rows, all columns converting them in 2D list first and making final dataframe from it.
CodePudding user response:
can you try this:
df=df.explode('Reason_details')
df = df.join(df['Reason_details'].apply(pd.Series)).drop('Reason_details',axis=1).explode('reasonCodes').drop_duplicates()
CodePudding user response:
here is a slight different manner
df[['UserId', 'UserName']].merge(df['Reason_details']
.explode() # convert list to rows
.apply(pd.Series) # creates dict keys as column
.explode('reasonCodes'), # convert reason code into rows
left_index=True, # merge with original DF
right_index=True)
UserId UserName reasonId reasonDescription reasonCodes
0 851 Bob 264 prohibited 1
0 851 Bob 264 prohibited 2
0 851 Bob 267 Expired 25
1 852 Jack 273 Restricted 29