Home > Back-end >  Normalize array of json records into dataframe
Normalize array of json records into dataframe

Time:11-01

I want to create a dataframe from the owid covid19 json data located here. The json has an array of daily records in the data column, this along with the country index is what I'm trying to make into a dataframe.

{"AFG":{"continent":"Asia","location":"Afghanistan","population":39835428.0,"population_density":54.422,"median_age":18.6,"aged_65_older":2.581,"aged_70_older":1.337,"gdp_per_capita":1803.987,"cardiovasc_death_rate":597.029,"diabetes_prevalence":9.59,"handwashing_facilities":37.746,"hospital_beds_per_thousand":0.5,"life_expectancy":64.83,"human_development_index":0.511,"data":[{"date":"2020-02-24","total_cases":5.0,"new_cases":5.0,"total_cases_per_million":0.126,"new_cases_per_million":0.126,"stringency_index":8.33},{"date":"2020-02-25","total_cases":5.0,"new_cases":0.0,"total_cases_per_million":0.126,"new_cases_per_million":0.0,"stringency_index":8.33},

So far I've been loading the file straight into a dataframe

df = pd.read_json('owid-covid-data.json', orient='index')

and then normalizing the array as such

data = pd.concat([pd.DataFrame(json_normalize(key)) for key in df['data']])

Which works fine apart from dropping the index and thus giving no identifier to join back to the static values.

I also imagine there's a more efficient way of normalizing than what I've used.

Any help greatly appreciated!

CodePudding user response:

It's not the most efficient way, but it works:

new_df = pd.DataFrame()
for index, row in df.iterrows():
    tmp = pd.json_normalize(row['data'])
    tmp['country_code'] = index
    new_df = pd.concat([new_df, tmp])

EDIT:

I found a more efficient way, by normalizing all the JSONs at once:

country_codes = []
datas = []
for index, data in zip(df.index, df['data']):
    datas.extend(data)
    country_codes.extend(len(data)*[index])
    
new_df = pd.DataFrame(datas)
new_df['country_code'] = country_codes

Improved from 9.38 s ± 856 ms per loop to 1.37 s ± 12 ms per loop

CodePudding user response:

df = pd.read_json("https://covid.ourworldindata.org/data/owid-covid-data.json", orient='index')

# explode records/ lists into new rows, convert to dict, 
# use it to create a new DataFrame, and transpose it
data = pd.DataFrame(df['data'].explode().to_dict()).T

df = df.drop(columns='data').join(data)

Performance

Ignoring the data reading

>>> %%timeit
... data = pd.DataFrame(df['data'].explode().to_dict()).T
... df.drop(columns='data').join(data)

84.4 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

EDIT - DISCLAIMER

The above solution is actually wrong. During the to_dict conversion many records are lost because there are lots of repeated country codes keys (Series index), and keys of a dictionary must be unique. To solve this, first, we need to reset the index to make sure it's unique. Only after the new DataFrame is created, we add the original index.

data = df['data'].explode()
data_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
data_df.index = data.index

df = df.drop(columns='data').join(data_df)

This is much slower than the previous solution because there are actually 127314 records, while the previous solution only produces 233 records (unique country codes). Even if we ignore the join part, as Bruno's solution does, it's much slows than Bruno's solution

>>> %%timeit 
... data = df['data'].explode()
... new_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
... new_df.index = data.index

17.6 s ± 972 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Bruno's solution 
>>> %%timeit
... country_codes = []
... datas = []
... for index, data in zip(df.index, df['data']):
...     datas.extend(data)
...     country_codes.extend(len(data)*[index])
...     
... new_df = pd.DataFrame(datas)
... new_df['country_code'] = country_codes

1.86 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT 2 - There is a better way...

I found a much better and simpler solution, I was definitely overcomplicating. It's basically the same idea as Bruno's solution

data = df['data'].explode()
data_df = pd.DataFrame(data.tolist(), index=data.index)

df = df.drop(columns='data').join(data_df)

It's as fast as Bruno's solution

>>> %%timeit 
... data = df['data'].explode()
... pd.DataFrame(data.tolist(), index=data.index)

1.87 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related