I have json files in this generic format:
{"attribute1": "test1",
"attribute2": "test2",
"data": {
"0":
{"metadata": {
"timestamp": "2022-08-14"},
"detections": {
"0": {"dim1": 40, "dim2": 30},
"1": {"dim1": 50, "dim2": 20}}},
"1":
{"metadata": {
"timestamp": "2022-08-15"},
"detections": {
"0": {"dim1": 30, "dim2": 10},
"1": {"dim1": 100, "dim2": 80}}}}}
These json files refer to the collection of measurements through a 3D camera. The upper levels in the key data
correspond to frames and each frame has its own metadata
and can have multiple detections
objects, each object with its own dimensions (here represented by dim1
and dim2
). I want to convert this type of json file to a pandas
DataFrame in the following format:
timestamp | dim1 | dim2 |
---|---|---|
2022-08-14 | 40 | 30 |
2022-08-14 | 50 | 20 |
2022-08-15 | 30 | 10 |
2022-08-15 | 100 | 80 |
So, any fields in metadata
(here I only added timestamp
but there could be several) must be repeated for each entry in the detection
key.
I can convert this type of json to a pandas
DataFrame, but it requires multiple steps and for loops within a single file to concatenate everything at the end. I have also tried pd.json_normalize
and playing with the arguments record_path
, meta
and max_level
but so far I was not able to, in a few steps, convert this type of json to a DataFrame. Is there a clean way to do this?
CodePudding user response:
I think a good solution could be:
data = [dict(d1, **{'detections': list(d1['detections'].values())})
for d1 in d['data'].values()]
#data = list(map(lambda d1: dict(d1,
# **{'detections': list(d1['detections'].values())}),
# d['data'].values()))
print(data)
df = \
pd.json_normalize(data, 'detections', [['metadata', 'timestamp']])\
.rename({'metadata.timestamp': 'timestamp'}, axis=1)
print(df)
#[{'metadata': {'timestamp': '2022-08-14'}, 'detections': [{'dim1': 40, 'dim2': 30}, {'dim1': 50, 'dim2': 20}]}, {'metadata': {'timestamp': '2022-08-15'}, 'detections': [{'dim1': 30, 'dim2': 10}, {'dim1': 100, 'dim2': 80}]}]
# dim1 dim2 timestamp
#0 40 30 2022-08-14
#1 50 20 2022-08-14
#2 30 10 2022-08-15
#3 100 80 2022-08-15
CodePudding user response:
Use nested dictioanry comprehension for flatten values
and merge subdictionaries, last pass to DataFrame constructor:
json = {"attribute1": "test1",
"attribute2": "test2",
"data": {
"0":
{"metadata": {
"timestamp": "2022-08-14"},
"detections": {
"0": {"dim1": 40, "dim2": 30},
"1": {"dim1": 50, "dim2": 20}}},
"1":
{"metadata": {
"timestamp": "2022-08-15"},
"detections": {
"0": {"dim1": 30, "dim2": 10},
"1": {"dim1": 100, "dim2": 80}}}}}
L = [{**x['metadata'], **y} for x in json['data'].values()
for y in x['detections'].values()]
df = pd.DataFrame(L)
print (df)
timestamp dim1 dim2
0 2022-08-14 40 30
1 2022-08-14 50 20
2 2022-08-15 30 10
3 2022-08-15 100 80