Read json files in pandas dataframe-CodePudding

I have large pandas dataframe (17 000 rows) with a filepath in each row associated with a specific json file. For each row I want to read the json file content and extract the content into a new dataframe.

The dataframe looks something like this:

                                                       
0      /home/user/processed/config1.json
1      /home/user/processed/config2.json
2      /home/user/processed/config3.json
3      /home/user/processed/config4.json
4      /home/user/processed/config5.json
...                                                  ...
16995  /home/user/processed/config16995.json
16996  /home/user/processed/config16996.json
16997  /home/user/processed/config16997.json
16998  /home/user/processed/config16998.json
16999  /home/user/processed/config16999.json

What is the most efficient way to do this?

I believe a simple for-loop might be best suited here?

import json
json_content = []

for row in df:
  with open(row) as file:
    json_content.append(json.load(file))

result = pd.DataFrame(json_content)

CodePudding user response：

Possible solution is the following:

# pip install pandas

import pandas as pd

#convert column with paths to list, where: : - all rows, 0 - first column
paths = df.iloc[:, 0].tolist()

all_dfs = []
for path in paths:
    df = pd.read_json(path, encoding='utf-8')
    all_dfs.append(df)

Each df in all_dfs can be accessed individually or in loop by index like all_dfs[0], all_dfs[1] and etc.

If you wish you can concat all_dfs into single dataframe.

dfs = df.concat(all_dfs, axis=1)

CodePudding user response：

Generally, I'd try with iterrows() function (as a first hit to improve efficiency). Implementation could possibly look like that:

import json
import pandas as pd


json_content = []

for row in df.iterrows():
    with open(row) as file:
        json_content.append(json.load(file))

result = pd.Series(json_content)