Home > Back-end >  I need to transform the jsonline to df
I need to transform the jsonline to df

Time:03-02

Below is the example of my jsonline data

{"allocation":{"duration":1,"rewards":[{"expiryDate":"2021-01-01 23:59:59","rewardSubType":"Boxes"}]},"MODULE":"a","APPLICATION":"g","GUID2":"gh-gh-gh","METHOD":"Expensive","timestamp":"2021-01-01 07:00:00.497"}
{"allocation":{"duration":2,"rewards":[{"expiryDate":"2021-01-02 23:59:59","rewardSubType":"Keys"}]},"MODULE":"a","APPLICATION":"b","GUID":"gh-gh-gh","METHOD":"Free","timestamp":"2021-01-02 07:00:00.497"}

I have run the following code

import pandas as pd
import json
from pandas import json_normalize

with open('C:\\Vodacom\\KPI Dashboard\\ussd_vservices_audit.jsonl') as f:
    lines = f.read().splitlines()

df_inter = pd.DataFrame(lines)
df_inter.columns = ['json_element']
df_inter['json_element'].apply(json.loads)
df_final = pd.json_normalize(df_inter['json_element'].apply(json.loads))
df_final

That results in

result

However, I need to further split up the newly created allocation.rewards (old "rewards") column into separate columns too. It now has square brackets and single quotes too so I have no idea how to do this.

Please assist Thanks!

Amazing answer thank you - exactly what I needed! But now I have found when running against my full dataset, I have some rows of data that don't have a rewards section/column as below:

{"offer":null,"MODULE":"purchase","APPLICATION":"new","GUID":"cf-83-11-001","METHOD":"Upsize","timestamp":"2021-02-01 06:00:01.158"}

This causes me to then have the error TypeError: object of type 'float' has no len()

CodePudding user response:

You can continue your script with below code and get the desired results:

for index, row in df_final.iterrows():
  value = row["allocation.rewards"]
  if value == value and len(value) == 1:
    for key, keyValue in value[0].items():
      df_final.loc[index, key] = keyValue
df_final.drop(columns=["allocation.rewards"], inplace=True)
df_final

Output

MODULE APPLICATION GUID2 METHOD timestamp allocation.duration GUID offer expiryDate rewardSubType
0 a g gh-gh-gh Expensive 2021-01-01 07:00:00.497 1 nan nan 2021-01-01 23:59:59 Boxes
1 a b nan Free 2021-01-02 07:00:00.497 2 gh-gh-gh nan 2021-01-02 23:59:59 Keys
2 purchase new nan Upsize 2021-02-01 06:00:01.158 nan cf-83-11-001 nan nan nan

CodePudding user response:

If your rewards value only contains one element in list, you can explode allocation.rewards column in df_final

df_final['allocation.rewards'].apply(lambda row: pd.Series(row[0]))

            expiryDate rewardSubType
0  2021-01-01 23:59:59         Boxes
1  2021-01-02 23:59:59          Keys

Then concat it to df_final and drop the allocation.rewards column

df_ = pd.concat([df_final, df_final['allocation.rewards'].apply(lambda row: pd.Series(row[0]))], axis=1)
df_ = df_.drop('allocation.rewards', 1)

  MODULE APPLICATION     GUID2     METHOD                timestamp  allocation.duration      GUID           expiryDate rewardSubType
0      a           g  gh-gh-gh  Expensive  2021-01-01 07:00:00.497                    1       NaN  2021-01-01 23:59:59         Boxes
1      a           b       NaN       Free  2021-01-02 07:00:00.497                    2  gh-gh-gh  2021-01-02 23:59:59          Keys
  • Related