Home > Back-end >  How to merge multiple JSON objects into a Python DataFrame based on timestamp?
How to merge multiple JSON objects into a Python DataFrame based on timestamp?

Time:06-02

I have a graphql query that returns a string of JSON-formatted data, with 3 separate JSON objects inside. It looks like this:

{
  "data": {
    "readingsList": [
      {
        "value": 137,
        "millis": 1651449224000
      },
      {
        "value": 141,
        "millis": 1651448924000
      }
    ],
    "messagesList": [
      {
        "value": 138,
        "dateMillis": 1651445346000,
        "text": "foo",
        "type": "bar",
        "field1": False
      }
    ]
    "userList": [
      {
        "userTimezone": "America/Los_Angeles"
      }
    ]
  }
}

What I'm trying to do is

  1. Merge the first two objects (readingsList and messagesList) based on the time (millis and dateMillis) into a dataframe
  2. Convert that time into a UTC datetime value (e.g. 1651449224000 becomes 2022-05-01 18:53:44)
  3. Convert the UTC datetime value into local time for the user based on the users Timezone from userList

Desired output:

df.head(3)

    datetime             value   text   type   field1   ...
    2022-05-01 18:53:44  137     NA     NA     NA
    2022-05-01 18:48:44  141     NA     NA     NA
    2022-05-01 17:49:06  138     foo    bar    False

I can do steps 2 and 3 but I don't know how to do step 1.

If I convert the string using json.loads() and pd.read_json() I get the following output:

import json
import pandas as pd

json_str = load_data_gql(...)
j = json.loads(json_str)
df = pd.read_json(j)

df.head()

                  data
    groupsList    [{'userTimezone': 'America/Los_Angeles'}]
    messagesList  [{'value': 138, 'dateMillis': 1651445346000, ...
    readingsList  [{'value': 137, 'millis': 1651449224000}, {'value'...

I now suspect that the answer has to somehow do with json_normalize() but I'm having difficulty applying what I read in that documentation to navigate my JSON objects properly.

Any advice or help would be greatly appreciated, thank you so much in advance.

CodePudding user response:

Proposed Solution:

Merging the dataframes in this case can be done with pandas.concat([df_1,df_2])

Here's the code I used:

import json
import pandas as pd

json_obj = json.load(open('json_str_file.json', 'r')) # if reading from file
# json_obj = json.loads(json_str) # if reading from a string

# create two separate frames from each nested dictionary object
df_1 = pd.DataFrame.from_dict(json_obj['data']['messagesList'])
df_2 = pd.DataFrame.from_dict(json_obj['data']['readingsList'])

# set the index to the column you want to merge them on
df_1.set_index('dateMillis', inplace=True)
df_2.set_index('millis', inplace=True)

# use pd.concat to stack the dataframes together
df_merged = pd.concat([df_1,df_2])

# fix field1 to be a boolean field
df_merged['field1'] = df_merged['field1'].astype(bool)

# confirm the result matches the target
print(df_merged)

Output

               value text type  field1
1651445346000    138  foo  bar   False
1651449224000    137  NaN  NaN    True
1651448924000    141  NaN  NaN    True

From here you should be able to do steps 2 and 3 from your post.

Issues with the JSON

The example you gave had some formatting issues that might cause some confusion. messagesList and readingsList needed to be separated by a ',' for me. Also json.load() didn't like the value of False in my example.

Here is the reformatted JSON

{
  "data": {
    "readingsList": [
      {
        "value": 137,
        "millis": 1651449224000
      },
      {
        "value": 141,
        "millis": 1651448924000
      }
    ],
    "messagesList": [
      {
        "value": 138,
        "dateMillis": 1651445346000,
        "text": "foo",
        "type": "bar",
        "field1": 0
      }
    ],
    "userList": [
      {
        "userTimezone": "America/Los_Angeles"
      }
    ]
  }
}

Potential Confusion:

  • JSON string could be formatted poorly
  • json.loads() returns an object of type dict with nested elements.
  • pd.read_json() expects a object of type str
  • using pd.DataFrame.from_dict() works with dict objects and allows you to address the nested components like this: j['data']['messagesList']
  • Related