python dictionary to json-CodePudding

the output of file comes as dictionary, with 5 columns. Due to the 5th column the first 4 are duplicated. My goals is to output it as a json, without duplicates in the following format.

Sample input:

test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]

Previously there were no duplicates so it was easy to transform to json as below:

result = defaultdict(set)
for i in test_dict:
    id = i.get('ID')
    if id:
        result[i].add(i.get('ID_A'))
        result[i].add(i.get('ID_B'))
        result[i].add(i.get('ID_C'))


output = []
for id, details in result.items():
    output.append(
        {
            "ID": id,
            "otherDetails": {
                "IDs": [
                    {"id": ref} for ref in details
                ]
            },
        }
    )

How could I add INVOICE to this without duplicating the rows? The output would look like this:

[{'ID': '"A"',
  'OtherDetails': {'IDs': [{'id': 'A1'},
                           {'id': 'A2'},
                           {'id': 'A3'}],
                  {'INVOICE': [{'id':'123'},
                               {'id':'345'}]}}}]

Thanks! (python 3.9)

CodePudding user response：

Basically, you can just do the same as for the IDs, using a second defaultdict (or similar) for the invoice IDs. Afterwards, use a nested list/dict comprehension to build the final result.

id_to_ids = defaultdict(set)
id_to_inv = defaultdict(set)
for d in test_dict:
    id_to_ids[d["ID"]] |= {d[k] for k in ["ID_A", "ID_B", "ID_C"]}
    id_to_inv[d["ID"]] |= {d["INVOICE"]}

result = [{
    'ID': k, 
    'OtherDetails': {
        'IDs': [{'id': v} for v in id_to_ids[k]],
        'INVOICE': [{'id': v} for v in id_to_inv[k]]
    }} for k in id_to_ids]

Note, though, that using this format, you will lose the information which of the "other" IDs was which, and with that invoice ID those were associated.

CodePudding user response：

You were pretty close. I would make the intermediate dictionary a little bit more straight forward. And have it just be a diction with id, and two lists.

When walking the original data, you just need to append INVOICE if there is already an entry for the ID. Then when you create the "json" format (a list of dictionary for each ID), all you have to do is use the lists you already generate. Here is the structure I propose.

from collections import defaultdict

test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]


result = {}
for i in test_dict:
    id = i.get('ID')
    if not id:
        continue
    if id in result:
        # just add INVOICE
        result[id]['INVOICE'].append(i.get('INVOICE'))
    else:
        # ID not in result dictionary, so populate it
        result[id] = {'IDs': [ i.get('ID_A'), i.get('ID_B'), i.get('ID_C')],
                      'INVOICE' : [i.get('INVOICE')]
                      }
 
output = []
for id, details in result.items():
    output.append(
        {
            "ID": id,
            "otherDetails": {
                "IDs": details['IDs'],
                'INVOICE': details['INVOICE']
                }
        }
    )

The trick for duplicate id's is handled by the if id in result where it only appends the invoice to the list of invoices. I will also add since we are using a lot of dict.get() calls rather than simple dict[], we are potentially adding a bunch of None's into these lists.

CodePudding user response：

The like the answer from @tobias_k, but it does not handle duplicate values for any of the ID_* or invoice columns. His answer is the most simple if order and repetition are not important. Checkout this if they are important.

import pandas as pd


def create_item(df: pd.DataFrame):
    output = list()
    groups = df.groupby(["ID", "ID_A", "ID_B", "ID_C"])[["INVOICE"]]
    for group, gdf in groups:
        row = dict()
        row["ID"] = group[0]
        row["OtherDetails"] = dict()
        row["OtherDetails"]["IDS"] = [{"id": x} for x in group[1:]]
        row["OtherDetails"]["INVOICE"] = [{"id": x} for x in gdf["INVOICE"]]
        output.append(row)

    return output


test_dict = [
    {"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
    {"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
    {"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
    {"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
    {"ID": "B", "ID_A": "A1", "ID_B": "A1", "ID_C": "A1", "INVOICE": "456"},
    {"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
]

test_df = pd.DataFrame(test_dict)

create_item(test_df)

Which will return

[{'ID': 'A',
  'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
   'INVOICE': [{'id': '123'}, {'id': '345'}]}},
 {'ID': 'B',
  'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
   'INVOICE': [{'id': '123'}, {'id': '345'}, {'id': '123'}]}}]