How do I assess a JSON's content layout and convert the nested JSON into a dataframe?-CodePudding

I have looked at related threads here and here but am still having issue when using the methods shown.

If I use the content method

import json
import requests
import pandas as pd

url1 = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"

content = requests.get(url1)
d = content.json()
print(d.keys())

I get

dict_keys(['meta', 'data'])

but using

pd.DataFrame(d['data'])

or using

url1 = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"

response = urlopen(url1)
data = json.loads(response.read())
print(type(data))
print(data.keys())
df = pd.DataFrame(data['data'])

I get strange fields, no headers, and so on. I have not used JSONs much and I've honestly found trying to understand what function parameters/arguments I need to use a bit daunting, hence my question here. I'd really like to grasp how I should handle this in the future in order to proceed efficiently. When I paste/open the complete download link in Firefox I get a nice markdown configuration but am unclear if anything here gives me insight as to additional arguments to more easily unpack the JSON.

In some cases for live or regularly updated data JSONs will be the only readily-available format, so I'd like to get a grasp of them. Secondly, I'd like to be more efficient in using them because this has been quite the headache when I can simply use

url2 = "https://data.cdc.gov/api/views/iw6q-r3ja/rows.csv"
df = pd.read_csv(url2)

And have the frame with headers without a lot of head scratching.

How do I best assess a JSON once I've pulled it? Is there a way I can format my code so that headers are properly read and placed and columns appear as they should if the data is later updated on the live url?

CodePudding user response：

To get dataframe with columns and expanded GeoLocation column you can use this example:

import requests
import pandas as pd

url = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"
data = requests.get(url).json()
columns = [c["name"] for c in data["meta"]["view"]["columns"]]
df = pd.DataFrame(data["data"], columns=columns)

# transform GeoLocation column
geo_columns = [
    "human_address",
    "latitude",
    "longitude",
    "machine_address",
    "needs_recoding",
]
m = df["GeoLocation"].isna()

df.loc[~m, "GeoLocation"] = df.loc[~m, "GeoLocation"].apply(
    lambda x: dict(zip(geo_columns, x))
)
df = pd.concat(
    [df, df.pop("GeoLocation").apply(pd.Series, dtype=object)], axis=1
)

# print some sample info:
print(df.head(10).to_markdown(index=False))

Prints:

sid	id	created_at	updated_at	meta	Year	LocationAbbr	LocationDesc	DataSource	PriorityArea1	PriorityArea2	PriorityArea3	PriorityArea4	Category	Topic	Indicator	Data_Value_Type	Data_Value_Unit	Data_Value	Data_Value_Alt	LowConfidenceLimit	HighConfidenceLimit	Break_Out_Category	Break_Out	CategoryId	TopicId	IndicatorID	Data_Value_TypeID	BreakOutCategoryId	BreakOutId	LocationID	States	Counties	human_address	latitude	longitude	machine_address	needs_recoding
row-n4wg.7kem-zdzf	00000000-0000-0000-F48B-EF7D8D2EEBDC	1560945568	1560945580	{ }	2006	US	United States	Medicare	None	None	None	None	Cardiovascular Diseases	Heart Failure	Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	5.2	5.2	5.1	5.3	Race	Other	C1	T5	MD501	Crude	BOC04	RAC07	59			nan	nan	nan	nan	nan
row-4ps6~tir7~eak6	00000000-0000-0000-312E-7E157399CC1E	1560945568	1560945579	{ }	2005	US	United States	Medicare	None	None	None	None	Cardiovascular Diseases	Heart Failure	Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	5.1	5.1	5.1	5.2	Race	Other	C1	T5	MD501	Crude	BOC04	RAC07	59			nan	nan	nan	nan	nan
row-5jvf.9pgz~ef7b	00000000-0000-0000-BE0E-5E5C8178E0D9	1560945568	1560945579	{ }	2007	US	United States	Medicare	None	None	None	None	Cardiovascular Diseases	Coronary Heart Disease	Prevalence of coronary heart disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	8.5	8.5	8.4	8.5	Age	65	C1	T4	MD301	Crude	BOC03	AGE06	59			nan	nan	nan	nan	nan
row-bs36_6isx_ap5s	00000000-0000-0000-BA79-10F0450F4A5B	1560945568	1560945579	{ }	2008	US	United States	Medicare	None	None	None	None	Cardiovascular Diseases	Coronary Heart Disease	Prevalence of coronary heart disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	5.1	5.1	5.1	5.1	Gender	Female	C1	T4	MD301	Crude	BOC02	GEN02	59			nan	nan	nan	nan	nan
row-nbji-7v9f_cqx7	00000000-0000-0000-DB13-EB09C35AA1A9	1560945568	1560945580	{ }	2004	US	United States	Medicare	None	None	None	None	Cardiovascular Diseases	Heart Failure	Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	6.1	6.1	6.1	6.1	Overall	Overall	C1	T5	MD501	Crude	BOC01	OVR01	59			nan	nan	nan	nan	nan
row-xp3h.yc48-3kaj	00000000-0000-0000-81C5-78665D906D5B	1560945568	1560945581	{ }	2012	US	United States	Medicare	None	None	None	None	Cardiovascular Diseases	Heart Failure	Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	4.6	4.6	4.6	4.7	Race	Other	C1	T5	MD501	Crude	BOC04	RAC07	59			nan	nan	nan	nan	nan
row-t8wh~9eb6~fwv3	00000000-0000-0000-143E-DFACD217B586	1560945568	1560945584	{ }	2005	NY	New York	Medicare	Million Hearts	None	None	None	Cardiovascular Diseases	Stroke	Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	3	3	2.9	3	Age	65	C1	T6	MD601	Crude	BOC03	AGE06	36	47	2092	nan	42.827	-75.544	nan	0
row-b9ax-3swz_dtrq	00000000-0000-0000-6F25-A8B180F4A613	1560945568	1560945584	{ }	2008	CT	Connecticut	Medicare	Million Hearts	None	None	None	Cardiovascular Diseases	Stroke	Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	3.6	3.6	3.5	3.7	Age	75	C1	T6	MD601	Crude	BOC03	AGE08	09	24	1043	nan	41.5627	-72.6498	nan	0
row-7yp3_x937_urxu	00000000-0000-0000-9924-6526B6266DAE	1560945568	1560945584	{ }	2008	ME	Maine	Medicare	Million Hearts	None	None	None	Cardiovascular Diseases	Stroke	Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	6	6	3.7	8.3	Race	Other	C1	T6	MD601	Crude	BOC04	RAC07	23	49	1725	nan	45.2542	-68.985	nan	0
row-ajps-vayv-qtww	00000000-0000-0000-8381-8F41271D61D5	1560945568	1560945585	{ }	2010	OH	Ohio	Medicare	Million Hearts	None	None	None	Cardiovascular Diseases	Stroke	Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 )	Crude	Percent (%)	3.7	3.7	3.7	3.8	Age	75	C1	T6	MD601	Crude	BOC03	AGE08	39	38	587	nan	40.0602	-82.4043	nan	0