Home > Net >  Read json with multiple levels into a DataFrame [python]
Read json with multiple levels into a DataFrame [python]

Time:12-08

I have json files in this generic format:

{"attribute1": "test1",
 "attribute2": "test2",
 "data": {
      "0": 
         {"metadata": {
             "timestamp": "2022-08-14"},
         "detections": {
             "0": {"dim1": 40, "dim2": 30},
             "1": {"dim1": 50, "dim2": 20}}},
      "1": 
         {"metadata": {
             "timestamp": "2022-08-15"},
         "detections": {
             "0": {"dim1": 30, "dim2": 10},
             "1": {"dim1": 100, "dim2": 80}}}}}

These json files refer to the collection of measurements through a 3D camera. The upper levels in the key data correspond to frames and each frame has its own metadata and can have multiple detections objects, each object with its own dimensions (here represented by dim1 and dim2). I want to convert this type of json file to a pandas DataFrame in the following format:

timestamp dim1 dim2
2022-08-14 40 30
2022-08-14 50 20
2022-08-15 30 10
2022-08-15 100 80

So, any fields in metadata (here I only added timestamp but there could be several) must be repeated for each entry in the detection key.

I can convert this type of json to a pandas DataFrame, but it requires multiple steps and for loops within a single file to concatenate everything at the end. I have also tried pd.json_normalize and playing with the arguments record_path, meta and max_level but so far I was not able to, in a few steps, convert this type of json to a DataFrame. Is there a clean way to do this?

CodePudding user response:

I think a good solution could be:

data = [dict(d1, **{'detections': list(d1['detections'].values())}) 
        for d1 in d['data'].values()]
#data = list(map(lambda d1: dict(d1, 
#                **{'detections': list(d1['detections'].values())}),
#               d['data'].values()))

print(data)
df = \
pd.json_normalize(data, 'detections', [['metadata', 'timestamp']])\
.rename({'metadata.timestamp': 'timestamp'}, axis=1)
print(df)

#[{'metadata': {'timestamp': '2022-08-14'}, 'detections': [{'dim1': 40, 'dim2': 30}, {'dim1': 50, 'dim2': 20}]}, {'metadata': {'timestamp': '2022-08-15'}, 'detections': [{'dim1': 30, 'dim2': 10}, {'dim1': 100, 'dim2': 80}]}]
#   dim1  dim2   timestamp
#0    40    30  2022-08-14
#1    50    20  2022-08-14
#2    30    10  2022-08-15
#3   100    80  2022-08-15

CodePudding user response:

Use nested dictioanry comprehension for flatten values and merge subdictionaries, last pass to DataFrame constructor:

json = {"attribute1": "test1",
 "attribute2": "test2",
 "data": {
      "0": 
         {"metadata": {
             "timestamp": "2022-08-14"},
         "detections": {
             "0": {"dim1": 40, "dim2": 30},
             "1": {"dim1": 50, "dim2": 20}}},
      "1": 
         {"metadata": {
             "timestamp": "2022-08-15"},
         "detections": {
             "0": {"dim1": 30, "dim2": 10},
             "1": {"dim1": 100, "dim2": 80}}}}}

L = [{**x['metadata'], **y} for x in json['data'].values() 
                            for y in x['detections'].values()]

df = pd.DataFrame(L)
print (df)
    timestamp  dim1  dim2
0  2022-08-14    40    30
1  2022-08-14    50    20
2  2022-08-15    30    10
3  2022-08-15   100    80
  • Related