I have looked at related threads here and here but am still having issue when using the methods shown.
If I use the content method
import json
import requests
import pandas as pd
url1 = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"
content = requests.get(url1)
d = content.json()
print(d.keys())
I get
dict_keys(['meta', 'data'])
but using
pd.DataFrame(d['data'])
or using
url1 = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"
response = urlopen(url1)
data = json.loads(response.read())
print(type(data))
print(data.keys())
df = pd.DataFrame(data['data'])
I get strange fields, no headers, and so on. I have not used JSONs much and I've honestly found trying to understand what function parameters/arguments I need to use a bit daunting, hence my question here. I'd really like to grasp how I should handle this in the future in order to proceed efficiently. When I paste/open the complete download link in Firefox I get a nice markdown configuration but am unclear if anything here gives me insight as to additional arguments to more easily unpack the JSON.
In some cases for live or regularly updated data JSONs will be the only readily-available format, so I'd like to get a grasp of them. Secondly, I'd like to be more efficient in using them because this has been quite the headache when I can simply use
url2 = "https://data.cdc.gov/api/views/iw6q-r3ja/rows.csv"
df = pd.read_csv(url2)
And have the frame with headers without a lot of head scratching.
How do I best assess a JSON once I've pulled it? Is there a way I can format my code so that headers are properly read and placed and columns appear as they should if the data is later updated on the live url?
CodePudding user response:
To get dataframe with columns and expanded GeoLocation column you can use this example:
import requests
import pandas as pd
url = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"
data = requests.get(url).json()
columns = [c["name"] for c in data["meta"]["view"]["columns"]]
df = pd.DataFrame(data["data"], columns=columns)
# transform GeoLocation column
geo_columns = [
"human_address",
"latitude",
"longitude",
"machine_address",
"needs_recoding",
]
m = df["GeoLocation"].isna()
df.loc[~m, "GeoLocation"] = df.loc[~m, "GeoLocation"].apply(
lambda x: dict(zip(geo_columns, x))
)
df = pd.concat(
[df, df.pop("GeoLocation").apply(pd.Series, dtype=object)], axis=1
)
# print some sample info:
print(df.head(10).to_markdown(index=False))
Prints:
sid | id | position | created_at | created_meta | updated_at | updated_meta | meta | Year | LocationAbbr | LocationDesc | DataSource | PriorityArea1 | PriorityArea2 | PriorityArea3 | PriorityArea4 | Category | Topic | Indicator | Data_Value_Type | Data_Value_Unit | Data_Value | Data_Value_Alt | Data_Value_Footnote_Symbol | Data_Value_Footnote | LowConfidenceLimit | HighConfidenceLimit | Break_Out_Category | Break_Out | CategoryId | TopicId | IndicatorID | Data_Value_TypeID | BreakOutCategoryId | BreakOutId | LocationID | States | Counties | human_address | latitude | longitude | machine_address | needs_recoding |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
row-n4wg.7kem-zdzf | 00000000-0000-0000-F48B-EF7D8D2EEBDC | 0 | 1560945568 | 1560945580 | { } | 2006 | US | United States | Medicare | None | None | None | None | Cardiovascular Diseases | Heart Failure | Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 5.2 | 5.2 | 5.1 | 5.3 | Race | Other | C1 | T5 | MD501 | Crude | BOC04 | RAC07 | 59 | nan | nan | nan | nan | nan | ||||||
row-4ps6~tir7~eak6 | 00000000-0000-0000-312E-7E157399CC1E | 0 | 1560945568 | 1560945579 | { } | 2005 | US | United States | Medicare | None | None | None | None | Cardiovascular Diseases | Heart Failure | Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 5.1 | 5.1 | 5.1 | 5.2 | Race | Other | C1 | T5 | MD501 | Crude | BOC04 | RAC07 | 59 | nan | nan | nan | nan | nan | ||||||
row-5jvf.9pgz~ef7b | 00000000-0000-0000-BE0E-5E5C8178E0D9 | 0 | 1560945568 | 1560945579 | { } | 2007 | US | United States | Medicare | None | None | None | None | Cardiovascular Diseases | Coronary Heart Disease | Prevalence of coronary heart disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 8.5 | 8.5 | 8.4 | 8.5 | Age | 65 | C1 | T4 | MD301 | Crude | BOC03 | AGE06 | 59 | nan | nan | nan | nan | nan | ||||||
row-bs36_6isx_ap5s | 00000000-0000-0000-BA79-10F0450F4A5B | 0 | 1560945568 | 1560945579 | { } | 2008 | US | United States | Medicare | None | None | None | None | Cardiovascular Diseases | Coronary Heart Disease | Prevalence of coronary heart disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 5.1 | 5.1 | 5.1 | 5.1 | Gender | Female | C1 | T4 | MD301 | Crude | BOC02 | GEN02 | 59 | nan | nan | nan | nan | nan | ||||||
row-nbji-7v9f_cqx7 | 00000000-0000-0000-DB13-EB09C35AA1A9 | 0 | 1560945568 | 1560945580 | { } | 2004 | US | United States | Medicare | None | None | None | None | Cardiovascular Diseases | Heart Failure | Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 6.1 | 6.1 | 6.1 | 6.1 | Overall | Overall | C1 | T5 | MD501 | Crude | BOC01 | OVR01 | 59 | nan | nan | nan | nan | nan | ||||||
row-xp3h.yc48-3kaj | 00000000-0000-0000-81C5-78665D906D5B | 0 | 1560945568 | 1560945581 | { } | 2012 | US | United States | Medicare | None | None | None | None | Cardiovascular Diseases | Heart Failure | Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 4.6 | 4.6 | 4.6 | 4.7 | Race | Other | C1 | T5 | MD501 | Crude | BOC04 | RAC07 | 59 | nan | nan | nan | nan | nan | ||||||
row-t8wh~9eb6~fwv3 | 00000000-0000-0000-143E-DFACD217B586 | 0 | 1560945568 | 1560945584 | { } | 2005 | NY | New York | Medicare | Million Hearts | None | None | None | Cardiovascular Diseases | Stroke | Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 3 | 3 | 2.9 | 3 | Age | 65 | C1 | T6 | MD601 | Crude | BOC03 | AGE06 | 36 | 47 | 2092 | nan | 42.827 | -75.544 | nan | 0 | ||||
row-b9ax-3swz_dtrq | 00000000-0000-0000-6F25-A8B180F4A613 | 0 | 1560945568 | 1560945584 | { } | 2008 | CT | Connecticut | Medicare | Million Hearts | None | None | None | Cardiovascular Diseases | Stroke | Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 3.6 | 3.6 | 3.5 | 3.7 | Age | 75 | C1 | T6 | MD601 | Crude | BOC03 | AGE08 | 09 | 24 | 1043 | nan | 41.5627 | -72.6498 | nan | 0 | ||||
row-7yp3_x937_urxu | 00000000-0000-0000-9924-6526B6266DAE | 0 | 1560945568 | 1560945584 | { } | 2008 | ME | Maine | Medicare | Million Hearts | None | None | None | Cardiovascular Diseases | Stroke | Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 6 | 6 | 3.7 | 8.3 | Race | Other | C1 | T6 | MD601 | Crude | BOC04 | RAC07 | 23 | 49 | 1725 | nan | 45.2542 | -68.985 | nan | 0 | ||||
row-ajps-vayv-qtww | 00000000-0000-0000-8381-8F41271D61D5 | 0 | 1560945568 | 1560945585 | { } | 2010 | OH | Ohio | Medicare | Million Hearts | None | None | None | Cardiovascular Diseases | Stroke | Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) | Crude | Percent (%) | 3.7 | 3.7 | 3.7 | 3.8 | Age | 75 | C1 | T6 | MD601 | Crude | BOC03 | AGE08 | 39 | 38 | 587 | nan | 40.0602 | -82.4043 | nan | 0 |