How can we loop through URLs, all of which consist of JSON data, convert each to a dataframe, and sa-CodePudding

I am trying to read data from multiple URLs, convert each JSON dataset to a dataframe, and save each dataframe in tabular format, like CSV. I am testing this code.

import requests

url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()


url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()


url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()


url = 'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json'
r = requests.get(url)
data = r.json()


url = 'https://www.mskcc.org/teaser/standard-charges-nyc.json'
r = requests.get(url)
data = r.json()

That code seems to read each URL fine. I guess I'm stuck with how to standardize the process of converting multiple JSON data sources into dataframes, and save each dataframe as a CSV. I tested this code.

import pandas as pd 
import requests
import json

url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\northwell.csv')


url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\chsli.csv')

That seems to save data in two CSVs and each one has many, many of columns and just a few rows of data. I'm not sure why this happens. Somehow, it seems like pd.json_normalize is NOT converting the JSON into a tabular shape. Any thoughts?
Also, I'd like to parse the URL to include it in the name of the CSV that is saved. So, this 'https://www.northwell.edu/' becomes this 'C:\Users\ryans\Desktop\northwell.csv' and this 'https://www.chsli.org/' becomes this 'C:\Users\ryans\Desktop\chsli.csv'.

CodePudding user response：

For the JSON decoding :

The problem is that each url has its own data format

For example with "https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json" → the json data is inside the data field.

import requests
import json
import pandas as pd

url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()
data = pd.DataFrame(data['data'], columns=data['columns'], index=data['index'])

For the URL parsing :

urls = ['https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json',
 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json',
 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json',
 'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json',
 'https://www.mskcc.org/teaser/standard-charges-nyc.json']

for u in urls:
    print('C:\\Users\\ryans\\Desktop\\'   u.split('.')[1]   '.csv')

output :

C:\Users\ryans\Desktop\chsli.csv
C:\Users\ryans\Desktop\northwell.csv
C:\Users\ryans\Desktop\montefiorehealthsystem.csv
C:\Users\ryans\Desktop\kaleidahealth.csv
C:\Users\ryans\Desktop\mskcc.csv