Home > Blockchain >  How do I assess a JSON's content layout and convert the nested JSON into a dataframe?
How do I assess a JSON's content layout and convert the nested JSON into a dataframe?

Time:08-27

I have looked at related threads here and here but am still having issue when using the methods shown.

If I use the content method

import json
import requests
import pandas as pd

url1 = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"

content = requests.get(url1)
d = content.json()
print(d.keys())

I get

dict_keys(['meta', 'data'])

but using

pd.DataFrame(d['data'])

or using

url1 = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"

response = urlopen(url1)
data = json.loads(response.read())
print(type(data))
print(data.keys())
df = pd.DataFrame(data['data'])

I get strange fields, no headers, and so on. I have not used JSONs much and I've honestly found trying to understand what function parameters/arguments I need to use a bit daunting, hence my question here. I'd really like to grasp how I should handle this in the future in order to proceed efficiently. When I paste/open the complete download link in Firefox I get a nice markdown configuration but am unclear if anything here gives me insight as to additional arguments to more easily unpack the JSON.

In some cases for live or regularly updated data JSONs will be the only readily-available format, so I'd like to get a grasp of them. Secondly, I'd like to be more efficient in using them because this has been quite the headache when I can simply use

url2 = "https://data.cdc.gov/api/views/iw6q-r3ja/rows.csv"
df = pd.read_csv(url2)

And have the frame with headers without a lot of head scratching.

How do I best assess a JSON once I've pulled it? Is there a way I can format my code so that headers are properly read and placed and columns appear as they should if the data is later updated on the live url?

CodePudding user response:

To get dataframe with columns and expanded GeoLocation column you can use this example:

import requests
import pandas as pd

url = "https://chronicdata.cdc.gov/views/iw6q-r3ja/rows.json"
data = requests.get(url).json()
columns = [c["name"] for c in data["meta"]["view"]["columns"]]
df = pd.DataFrame(data["data"], columns=columns)

# transform GeoLocation column
geo_columns = [
    "human_address",
    "latitude",
    "longitude",
    "machine_address",
    "needs_recoding",
]
m = df["GeoLocation"].isna()

df.loc[~m, "GeoLocation"] = df.loc[~m, "GeoLocation"].apply(
    lambda x: dict(zip(geo_columns, x))
)
df = pd.concat(
    [df, df.pop("GeoLocation").apply(pd.Series, dtype=object)], axis=1
)

# print some sample info:
print(df.head(10).to_markdown(index=False))

Prints:

sid id position created_at created_meta updated_at updated_meta meta Year LocationAbbr LocationDesc DataSource PriorityArea1 PriorityArea2 PriorityArea3 PriorityArea4 Category Topic Indicator Data_Value_Type Data_Value_Unit Data_Value Data_Value_Alt Data_Value_Footnote_Symbol Data_Value_Footnote LowConfidenceLimit HighConfidenceLimit Break_Out_Category Break_Out CategoryId TopicId IndicatorID Data_Value_TypeID BreakOutCategoryId BreakOutId LocationID States Counties human_address latitude longitude machine_address needs_recoding
row-n4wg.7kem-zdzf 00000000-0000-0000-F48B-EF7D8D2EEBDC 0 1560945568 1560945580 { } 2006 US United States Medicare None None None None Cardiovascular Diseases Heart Failure Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 5.2 5.2 5.1 5.3 Race Other C1 T5 MD501 Crude BOC04 RAC07 59 nan nan nan nan nan
row-4ps6~tir7~eak6 00000000-0000-0000-312E-7E157399CC1E 0 1560945568 1560945579 { } 2005 US United States Medicare None None None None Cardiovascular Diseases Heart Failure Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 5.1 5.1 5.1 5.2 Race Other C1 T5 MD501 Crude BOC04 RAC07 59 nan nan nan nan nan
row-5jvf.9pgz~ef7b 00000000-0000-0000-BE0E-5E5C8178E0D9 0 1560945568 1560945579 { } 2007 US United States Medicare None None None None Cardiovascular Diseases Coronary Heart Disease Prevalence of coronary heart disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 8.5 8.5 8.4 8.5 Age 65 C1 T4 MD301 Crude BOC03 AGE06 59 nan nan nan nan nan
row-bs36_6isx_ap5s 00000000-0000-0000-BA79-10F0450F4A5B 0 1560945568 1560945579 { } 2008 US United States Medicare None None None None Cardiovascular Diseases Coronary Heart Disease Prevalence of coronary heart disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 5.1 5.1 5.1 5.1 Gender Female C1 T4 MD301 Crude BOC02 GEN02 59 nan nan nan nan nan
row-nbji-7v9f_cqx7 00000000-0000-0000-DB13-EB09C35AA1A9 0 1560945568 1560945580 { } 2004 US United States Medicare None None None None Cardiovascular Diseases Heart Failure Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 6.1 6.1 6.1 6.1 Overall Overall C1 T5 MD501 Crude BOC01 OVR01 59 nan nan nan nan nan
row-xp3h.yc48-3kaj 00000000-0000-0000-81C5-78665D906D5B 0 1560945568 1560945581 { } 2012 US United States Medicare None None None None Cardiovascular Diseases Heart Failure Prevalence of heart failure hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 4.6 4.6 4.6 4.7 Race Other C1 T5 MD501 Crude BOC04 RAC07 59 nan nan nan nan nan
row-t8wh~9eb6~fwv3 00000000-0000-0000-143E-DFACD217B586 0 1560945568 1560945584 { } 2005 NY New York Medicare Million Hearts None None None Cardiovascular Diseases Stroke Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 3 3 2.9 3 Age 65 C1 T6 MD601 Crude BOC03 AGE06 36 47 2092 nan 42.827 -75.544 nan 0
row-b9ax-3swz_dtrq 00000000-0000-0000-6F25-A8B180F4A613 0 1560945568 1560945584 { } 2008 CT Connecticut Medicare Million Hearts None None None Cardiovascular Diseases Stroke Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 3.6 3.6 3.5 3.7 Age 75 C1 T6 MD601 Crude BOC03 AGE08 09 24 1043 nan 41.5627 -72.6498 nan 0
row-7yp3_x937_urxu 00000000-0000-0000-9924-6526B6266DAE 0 1560945568 1560945584 { } 2008 ME Maine Medicare Million Hearts None None None Cardiovascular Diseases Stroke Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 6 6 3.7 8.3 Race Other C1 T6 MD601 Crude BOC04 RAC07 23 49 1725 nan 45.2542 -68.985 nan 0
row-ajps-vayv-qtww 00000000-0000-0000-8381-8F41271D61D5 0 1560945568 1560945585 { } 2010 OH Ohio Medicare Million Hearts None None None Cardiovascular Diseases Stroke Prevalence of cerebrovascular disease hospitalizations among all hospitalizations, US Medicare FFS beneficiaries (65 ) Crude Percent (%) 3.7 3.7 3.7 3.8 Age 75 C1 T6 MD601 Crude BOC03 AGE08 39 38 587 nan 40.0602 -82.4043 nan 0
  • Related