I have a JSON file that includes data of daily covid cases of all the Canadian provinces(13), and I'm trying to transform the design to load it into MongoDB and get 13 collections of province documents with code and name keys on the root and an array of nested stats. My problem is that the JSON file is not group all the stats corresponding to each province, in other words, each province is printed several times[fig 1] instead of grouping all[fig 2](This last result is the one I need)
Sample code:
a)This is how my JSON currently looks (This is only a sample of my full JSON, but I think it would work to illustrate the structure of my JSON)
[
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "25-01-2020"
}
]
},
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "26-01-2020"
}
]
},
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "27-01-2020"
}
]
}
]
b)This is the result that I need(again, this is only a sample of my full JSON, but if I can get a hint on how I can transform it, I think I will be able to replicate it to all the JSON)
[
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "25-01-2020"
},
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "26-01-2020"
},
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "27-01-2020"
}
]
}
]
My code:
#Import modules json and pandas
import json as js
import pandas as pd
#Read the json and convert it to dict
f = open('C:\\project_files\covid_rawdata.json')
data = js.load(f)
cases_dict = data
cases_json = js.dumps(cases_dict, indent=2)
#Filter all provinces to remove "Repatriated" values
filterDict = [x for x in cases_dict if x['province'] != 'Repatriated']
#Read and loop Provinces to add or change the code or province name (The original JSON included wrong province names and not included the province code)
for c in filterDict:
c['stats'] =[{"cases":c['cases'],"cumulative_cases":c['cumulative_cases'],"date_report":c['date_report']}]
del c['date_report'], c['cumulative_cases'], c['cases']
if(c['province'] == 'BC'):
c['code'] = 'BC'
c['province'] = 'British Columbia'
if(c['province'] == 'NL'):
c['code'] = 'NL'
c['province'] = 'Newfoundland and Labrador'
if(c['province'] == 'NWT'):
c['code'] = 'NWT'
c['province'] = 'Northwest Territories'
if(c['province'] == 'PEI'):
c['province'] = 'Prince Edward Island'
c['code'] = 'PEI'
if(c['province'] == 'Manitoba'):
c['code'] = 'MB'
if(c['province'] == 'Ontario'):
c['code'] = 'ON'
if(c['province'] == 'New Brunswick'):
c['code'] = 'NB'
if(c['province'] == 'Nova Scotia'):
c['code'] = 'NS'
if(c['province'] == 'Nunavut'):
c['code'] = 'NU'
if(c['province'] == 'Quebec'):
c['code'] = 'QB'
if(c['province'] == 'Yukon'):
c['code'] = 'YT'
if(c['province'] == 'Saskatchewan'):
c['code'] = 'SK'
if(c['province'] == 'Alberta'):
c['code'] = 'AB'
#print(filterDict)
with open('filterDict3.json','w') as f:
js.dump(filterDict,f,indent=2,sort_keys=True)
Additional information: I'm working in Visual Studio 2019 with Python 3.9
Thank you very much for taking the time to review my question, I really appreciate your support. N.
CodePudding user response:
You can use pandas: load your dictionary/json as a dataframe, group by code/province and aggregate with sum (it will concatenate your stats
lists):
import pandas as pd
data = [
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "25-01-2020"
}
]
},
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "26-01-2020"
}
]
},
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "27-01-2020"
}
]
}
]
df = pd.DataFrame(data) # or use pd.read_json(file)
print(df.groupby(['code', 'province'], as_index=False).sum()
.to_json(indent=4, orient='records'))
Output:
[
{
"code":"AB",
"province":"Alberta",
"stats":[
{
"cases":0,
"cumulative_cases":0,
"date_report":"25-01-2020"
},
{
"cases":0,
"cumulative_cases":0,
"date_report":"26-01-2020"
},
{
"cases":0,
"cumulative_cases":0,
"date_report":"27-01-2020"
}
]
}
]
CodePudding user response:
This json can be transformed using jmespath lib. here is it's documentation
data = [
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "25-01-2020"
}
]
},
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "26-01-2020"
}
]
},
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "27-01-2020"
}
]
},
{
"code": "BC",
"province": "British Columbia",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "25-11-2022"
}
]
},
{
"code": "BC",
"province": "British Columbia",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "22-11-2022"
}
]
},
{
"code": "BC",
"province": "British Columbia",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "27-11-2022"
}
]
}
]
sample code to transform json with jmespath expression:
import jmespath
# list all codes to iterate through and to help club stats data
codes = ['BC', 'AB']
final_list = []
for code in codes:
expr = '{code: [?code==`CODE_VALUE`].code|[0], province: [?code==`CODE_VALUE`].province|[0], stats: [?code==`CODE_VALUE`].stats[]}'.replace("CODE_VALUE", code)
result = jmespath.search(expr, data)
final_list.append(result)
print("final_result: {}".format(final_list))
output final_list:
[
{
"code": "BC",
"province": "British Columbia",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "25-11-2022"
},
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "22-11-2022"
},
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "27-11-2022"
}
]
},
{
"code": "AB",
"province": "Alberta",
"stats": [
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "25-01-2020"
},
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "26-01-2020"
},
{
"cases": 0,
"cumulative_cases": 0,
"date_report": "27-01-2020"
}
]
}
]