I am trying to read data from multiple URLs, convert each JSON dataset to a dataframe, and save each dataframe in tabular format, like CSV. I am testing this code.
import requests
url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()
url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.mskcc.org/teaser/standard-charges-nyc.json'
r = requests.get(url)
data = r.json()
That code seems to read each URL fine. I guess I'm stuck with how to standardize the process of converting multiple JSON data sources into dataframes, and save each dataframe as a CSV. I tested this code.
import pandas as pd
import requests
import json
url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\northwell.csv')
url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\chsli.csv')
That seems to save data in two CSVs and each one has many, many of columns and just a few rows of data. I'm not sure why this happens. Somehow, it seems like pd.json_normalize is NOT converting the JSON into a tabular shape. Any thoughts?
Also, I'd like to parse the URL to include it in the name of the CSV that is saved. So, this 'https://www.northwell.edu/' becomes this 'C:\Users\ryans\Desktop\northwell.csv' and this 'https://www.chsli.org/' becomes this 'C:\Users\ryans\Desktop\chsli.csv'.
CodePudding user response:
For the JSON decoding :
The problem is that each url has its own data format
For example with "https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json"
→ the json data is inside the data field.
import requests
import json
import pandas as pd
url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()
data = pd.DataFrame(data['data'], columns=data['columns'], index=data['index'])
For the URL parsing :
urls = ['https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json',
'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json',
'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json',
'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json',
'https://www.mskcc.org/teaser/standard-charges-nyc.json']
for u in urls:
print('C:\\Users\\ryans\\Desktop\\' u.split('.')[1] '.csv')
output :
C:\Users\ryans\Desktop\chsli.csv
C:\Users\ryans\Desktop\northwell.csv
C:\Users\ryans\Desktop\montefiorehealthsystem.csv
C:\Users\ryans\Desktop\kaleidahealth.csv
C:\Users\ryans\Desktop\mskcc.csv