the output of file comes as dictionary, with 5 columns. Due to the 5th column the first 4 are duplicated. My goals is to output it as a json, without duplicates in the following format.
Sample input:
test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]
Previously there were no duplicates so it was easy to transform to json as below:
result = defaultdict(set)
for i in test_dict:
id = i.get('ID')
if id:
result[i].add(i.get('ID_A'))
result[i].add(i.get('ID_B'))
result[i].add(i.get('ID_C'))
output = []
for id, details in result.items():
output.append(
{
"ID": id,
"otherDetails": {
"IDs": [
{"id": ref} for ref in details
]
},
}
)
How could I add INVOICE to this without duplicating the rows? The output would look like this:
[{'ID': '"A"',
'OtherDetails': {'IDs': [{'id': 'A1'},
{'id': 'A2'},
{'id': 'A3'}],
{'INVOICE': [{'id':'123'},
{'id':'345'}]}}}]
Thanks! (python 3.9)
CodePudding user response:
Basically, you can just do the same as for the IDs, using a second defaultdict
(or similar) for the invoice IDs. Afterwards, use a nested list/dict comprehension to build the final result.
id_to_ids = defaultdict(set)
id_to_inv = defaultdict(set)
for d in test_dict:
id_to_ids[d["ID"]] |= {d[k] for k in ["ID_A", "ID_B", "ID_C"]}
id_to_inv[d["ID"]] |= {d["INVOICE"]}
result = [{
'ID': k,
'OtherDetails': {
'IDs': [{'id': v} for v in id_to_ids[k]],
'INVOICE': [{'id': v} for v in id_to_inv[k]]
}} for k in id_to_ids]
Note, though, that using this format, you will lose the information which of the "other" IDs was which, and with that invoice ID those were associated.
CodePudding user response:
You were pretty close. I would make the intermediate dictionary a little bit more straight forward. And have it just be a diction with id, and two lists.
When walking the original data, you just need to append INVOICE if there is already an entry for the ID. Then when you create the "json" format (a list of dictionary for each ID), all you have to do is use the lists you already generate. Here is the structure I propose.
from collections import defaultdict
test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]
result = {}
for i in test_dict:
id = i.get('ID')
if not id:
continue
if id in result:
# just add INVOICE
result[id]['INVOICE'].append(i.get('INVOICE'))
else:
# ID not in result dictionary, so populate it
result[id] = {'IDs': [ i.get('ID_A'), i.get('ID_B'), i.get('ID_C')],
'INVOICE' : [i.get('INVOICE')]
}
output = []
for id, details in result.items():
output.append(
{
"ID": id,
"otherDetails": {
"IDs": details['IDs'],
'INVOICE': details['INVOICE']
}
}
)
The trick for duplicate id's is handled by the if id in result
where it only appends the invoice to the list of invoices. I will also add since we are using a lot of dict.get() calls rather than simple dict[], we are potentially adding a bunch of None's into these lists.
CodePudding user response:
The like the answer from @tobias_k, but it does not handle duplicate values for any of the ID_* or invoice columns. His answer is the most simple if order and repetition are not important. Checkout this if they are important.
import pandas as pd
def create_item(df: pd.DataFrame):
output = list()
groups = df.groupby(["ID", "ID_A", "ID_B", "ID_C"])[["INVOICE"]]
for group, gdf in groups:
row = dict()
row["ID"] = group[0]
row["OtherDetails"] = dict()
row["OtherDetails"]["IDS"] = [{"id": x} for x in group[1:]]
row["OtherDetails"]["INVOICE"] = [{"id": x} for x in gdf["INVOICE"]]
output.append(row)
return output
test_dict = [
{"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
{"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
{"ID": "B", "ID_A": "A1", "ID_B": "A1", "ID_C": "A1", "INVOICE": "456"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
]
test_df = pd.DataFrame(test_dict)
create_item(test_df)
Which will return
[{'ID': 'A',
'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
'INVOICE': [{'id': '123'}, {'id': '345'}]}},
{'ID': 'B',
'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
'INVOICE': [{'id': '123'}, {'id': '345'}, {'id': '123'}]}}]