Home > Back-end >  Concatenate multiple csv files from different folders into one csv file in python
Concatenate multiple csv files from different folders into one csv file in python

Time:12-17

I am trying to concatenate multiple csv files into one file(about 30 files). All csv files are located in different folders.

However, I have encountered an error while appending all files together: OSError: Initializing from file failed

Here is my code:

import pandas
import glob
 
path = 'xxx'
target_folders=['Apples', 'Oranges', 'Bananas','Raspberry','Strawberry', 'Blackberry','Gooseberry','Liche']
output ='yyy'
path_list = []
for idx in target_folders:
    lst_of_files = glob.glob(path   idx  '\\*.csv')
    latest_files = max(lst_of_files, key=os.path.getmtime)
    path_list.append(latest_files)
    df_list = [] 
    for file in path_list: 
        df = pd.read_csv(file) 
        df_list.append(df) 
    final_df = df.append(df for df in df_list) 
    combined_csv = pd.concat([pd.read_csv(f) for f in latest_files])

    combined_csv.to_csv(output   "combined_csv.csv", index=False)

    OSError                                   Traceback (most recent call last)
    <ipython-input-126-677d09511b64> in <module>
  1 df_list = []
  2 for file in latest_files:
  ----> 3     df = pd.read_csv(file)
  4     df_list.append(df)
  5 final_df = df.append(df for df in df_list)

    OSError: Initializing from file failed


    

CodePudding user response:

Try to simplify your code:

import pandas as pd
import pathlib

data_dir = 'xxx'
out_dir = 'yyy'

data = []
for filename in pathlib.Path(data_dir).glob('**/*.csv'):
    df = pd.read_csv(filename)
    data.append(df)

df = pd.concat(df, ignore_index=True)
df.to_csv(pathlib.Path('out_dir') / 'combined_csv.csv', index=False)

CodePudding user response:

Without seeing your CSV file it's hard to be sure, but I've come across this problem before with unusually formatted CSVs. The CSV parser may be having difficulty in determine the structure of the CSV files, separators etc.

Try df = pd.read_csv(file, engine = 'python')

From the docs: "The C engine is faster while the python engine is currently more feature-complete."

Try passing the engine = 'python' argument on reading a single CSV file and see if you get a successful read. That way you can narrow down the problem to either file reads or traversing the files.

  • Related