Speed up a for-loop converting many JSON files to a Pandas dataframe-CodePudding

I have the following function:

def json_to_pickle(json_path=REVIEWS_JSON_DIR,
                   pickle_path=REVIEWS_PICKLE_DIR,
                   force_update=False):
    '''Create a pickled dataframe from selected JSON files.'''

    current_date = datetime.today().strftime("%Y%m%d")
    filename_safe_path = json_path.replace("/", "_")

    # Get a list of all JSON files from a given directory
    in_paths = Path(json_path).glob("**/*.json")
    in_paths_list = []
    for path in in_paths:  # Convert generator to a simple list
        in_paths_list.append(path)

    out_path = (f"{pickle_path}"
                f"/{current_date}"
                f"_{filename_safe_path}"
                f"_reviews.pkl")

    if exists(out_path) == False or force_update == True:
        pprint("Creating a new pickle file from raw JSON files")
        df = pd.DataFrame()
        for path in tqdm(in_paths_list):
            with open(path, "r") as file:
                normalized_json_df = pd.json_normalize(json.load(file))
                df = pd.concat([df, normalized_json_df])
        df.to_pickle(out_path, compression="infer")
    else:
        pprint(f"Using existing pickle file ({out_path})")

    return out_path

Unless a pickle file already exists, it finds all JSON files in a given directory (including all subdirectories), normalizes them, concatenates them to a dataframe, and saves the dataframe to disk as a pickle. It takes 54 minutes to process 240.255 JSON files.

According to tqdm, the for-loop averages 73.51 iterations per second (running on an M1 MacBook Pro with 10 cores), but it seems to get slower over time. Presumably, as the dataframe grows larger. It starts at around 1668.44 iterations per second.

All the JSON files have an identical structure, but a couple of fields may be missing. The size varies between 500 bytes to 2 KB. Here is the JSON spec from Google's documentation.

What can I do to speed up this for-loop?

Edit:

This is how I changed the for-loop following the selected answer:

data_frames = []
for path in tqdm(in_paths_list):
    with open(path, "r") as file:
        data_frames.append(pd.json_normalize(json.load(file)))
pd.concat(data_frames).to_pickle(out_path, compression="infer")

Now it finishes in 1 minute and 29 seconds. Quite the improvement!

CodePudding user response：

Instead of loading each file and appending it to an ever bigger temporary dataframe, load all files in dataframes and concatenate them in a single operation.

The current code loads the N dataframes that correspond to the files and creates N-1 ever bigger dataframes with the exact same data.

I use this code to load all Excel files in a folder into a single dataframe :

root = rf"C:\path\to\folder"
matches=Path(root).glob("22*.xls*")
files=tqdm((pd.read_excel(file,skipfooter=1) 
            for file in matches 
            if not file.name.startswith("~")))
df=pd.concat(files)

In the question's case you could use:

def load_json(path):
    with open(path, "r") as file:
        df = pd.json_normalize(json.load(file))
        return df

in_paths = Path(json_path).glob("**/*.json")
files=tqdm((load_json(file) for file in in_paths))