I have a set of 3 JSON files which have an identical lay out. The number is expected to be much more when I push code into production, I only use 3 to keep the workflow fast.
The JSON files are structured like this
{
"results": [
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
"hasMore": true,
"offset": 520,
"total": 10523
}
There could be up to 250 records in the 'Results' header, starting with 'engagement'.
I am trying to find a way to merge all 3 JSON files, with Python, where I contain only the data in "results" and drop the rest.
So far I am able to either add all 3 JSON's together but they are still separated by the different "results" headers or the last JSON overwrites the previously made file and I am no further.
The expected results would look like this:
[
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
[
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
[
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
Any help will be much appericiated.
CodePudding user response:
This is relatively simple, but I would restructure the resulting JSON a bit as the current structure does no make much sense.
What the code below does is simply load the files, and add the first list element from the results
dict to the final_result
list. Basically now you have a list where each element of the list contains the part you needed from the original JSON files.
Then it saves that to a new file.
import json
filelist = ["file1.json", "file2.json", "file3.json"]
final_result = []
for filename in filelist:
with open(filename) as infile:
newdata = json.load(infile)
# grab the first list element from the results dict
newdata = newdata ["results"][0]
final_result.append(newdata)
with open("result.json", "w") as outfile:
json.dump(final_result, outfile, indent=4)
result.json
[
{
"engagement": {
"id": 1,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
21345
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
{
"engagement": {
"id": 2,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
21345
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
{
"engagement": {
"id": 3,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
21345
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
}
]
CodePudding user response:
After help from @Edo Aske I found a solution to the problem. Final code looks like this:
path = '/content/extracted_data/'
json_files = [jfile for jfile in os.listdir(path) if jfile.endswith('.json')]
final_result = []
for filename in json_files:
with open(path filename) as infile:
newdata = json.load(infile)
# grab the first list element from the results dict
newdata = newdata ["results"]
final_result.extend(newdata)
with open("result.json", "w") as outfile:
json.dump(final_result, outfile, indent=4)
The results is that all the JSON files are in separate Dicts and from there we can easily put them in a dataframe using pd.json_normalize.
Thanks for all your help guys!