Json to CSV file conversion using python-CodePudding

I have a sample JSON file like this

data = {
    "type": "video",
    "videoID": "vid001",
    "links": [
        {"type": "video", "videoID": "vid002", "links": []},
        {"type": "video",
         "videoID": "vid003",
         "links": [
             {"type": "video", "videoID": "vid004"},
             {"type": "video", "videoID": "vid005"},
         ]
         },
        {"type": "video", "videoID": "vid006"},
        {"type": "video",
         "videoID": "vid007",
         "links": [
             {"type": "video", "videoID": "vid008", "links": [
                 {"type": "video",
                  "videoID": "vid009",
                  "links": [{"type": "video", "videoID": "vid010"}]
                  }
             ]}
         ]},
    ]
}

i need to take only specific key and values from the json file and convert it into a CSV file

Code: Ref:Extracting Specific Keys/Values From A Messed-Up JSON File (Python)

def extract(data, keys):
    out = []
    queue = [data]
    while len(queue) > 0:
        current = queue.pop(0)
        if type(current) == dict:
            for key in keys:
                if key in current:
                    out.append({key: current[key]})

            for val in current.values():
                if type(val) in [list, dict]:
                    queue.append(val)
        elif type(current) == list:
            queue.extend(current)
    return out

x = extract(data, ["videoID","type"])
print(pd.DataFrame.from_dict(x))

When I'm passing 2 values through the extract(), getting NaN in between result

videoID   type
0   vid001    NaN
1      NaN  video
2   vid002    NaN
3      NaN  video
4   vid003    NaN
5      NaN  video
6   vid006    NaN
7      NaN  video
8   vid007    NaN
9      NaN  video
10  vid004    NaN
11     NaN  video
12  vid005    NaN
13     NaN  video
14  vid008    NaN
15     NaN  video
16  vid009    NaN
17     NaN  video
18  vid010    NaN
19     NaN  video

I need to get an output like the below

    videoID   type
0   vid001    video
1   vid002    video
2   vid003    video
3   vid004    video
etc...

and convert it into a csv file, can somebody help me to solve this issue

CodePudding user response：

Try recursion:

def get_type_videoid(o):
    if isinstance(o, dict):
        if "type" in o and "videoID" in o:
            yield o["videoID"], o["type"]
        for v in o.values():
            yield from get_type_videoid(v)
    elif isinstance(o, list):
        for v in o:
            yield from get_type_videoid(v)


df = pd.DataFrame(get_type_videoid(data), columns=["videoID", "type"])
print(df)

Prints:

  videoID   type
0  vid001  video
1  vid002  video
2  vid003  video
3  vid004  video
4  vid005  video
5  vid006  video
6  vid007  video
7  vid008  video
8  vid009  video
9  vid010  video

CodePudding user response：

Your method seems ok to me. You are just making a mistake in the for key in keysloop: what you are currently doing is you are creating a single dict ({key: current[key]}) for every element. So at the end you will have a out list that is a list of dicts that are not related, in which each videoID is in a dict and type is in a different dict. Like this:

[{'videoID': 'vid001'}, {'type': 'video'}, {'videoID': 'vid002'}, {'type': 'video'}, {'videoID': 'vid003'}, {'type': 'video'}, {'videoID': 'vid006'}, {'type': 'video'}, {'videoID': 'vid007'}, {'type': 'video'}, {'videoID': 'vid004'}, {'type': 'video'}, {'videoID': 'vid005'}, {'type': 'video'}, {'videoID': 'vid008'}, {'type': 'video'}, {'videoID': 'vid009'}, {'type': 'video'}, {'videoID': 'vid010'}, {'type': 'video'}]

What you would want instead is:

[{'videoID': 'vid001', 'type': 'video'}, {'videoID': 'vid002', 'type': 'video'}, {'videoID': 'vid003', 'type': 'video'}, {'videoID': 'vid006', 'type': 'video'}, {'videoID': 'vid007', 'type': 'video'}, {'videoID': 'vid004', 'type': 'video'}, {'videoID': 'vid005', 'type': 'video'}, {'videoID': 'vid008', 'type': 'video'}, {'videoID': 'vid009', 'type': 'video'}, {'videoID': 'vid010', 'type': 'video'}]

where each videoID is related to its type.

To do so you just have to create a dict when looping on keys, add each element to the dict and then append the dict to out list at the end of the loop on keys.

What I would do is this:

data_couple = {}
for key in keys:
    if key in current:
        data_couple[key] = current[key]
        # out.append({key: current[key]})
out.append(data_couple)``

so the whole extract function would become:

def extract(data, keys):
    out = []
    queue = [data]
    while len(queue) > 0:
        current = queue.pop(0)
        if type(current) == dict:
            data_couple = {}
            for key in keys:
                if key in current:
                    data_couple[key] = current[key]
            out.append(data_couple)
            for val in current.values():
                if type(val) in [list, dict]:
                    queue.append(val)
        elif type(current) == list:
            queue.extend(current)
    return out

Finally, to write a dict in a csv file I would simply use csv DictWriter:

import csv

def writeToCsv(dictionary, col_name):
    with open("file.csv", "w") as f:
        wr = csv.DictWriter(f, fieldnames=col_name)
        wr.writeheader()
        for elem in dictionary:
            wr.writerow(elem)
    
writeToCsv(your_dictionary, ["videoID", "type"])

This would create a CSV with videoID and type columns Alternatively, as suggested by Ivan Calderon answer here you could also use pandas to_csv method in one line:

you should change your dictionary before:

x = extract(data, ["videoID","type"])
d = {k: [v] for k, v in x.items()}
pd.DataFrame.from_dict(data=d, orient='columns').to_csv('dict_file.csv')

but I am not sure about this...