Home > Mobile >  Efficient way to concatenate excel files
Efficient way to concatenate excel files

Time:08-27

I have a directory with roughly 100 excel files each ~50MB in size. Each file has multiple worksheets with inconsistent names.

I would like to concatenate them together.

What I have tried so far in python:

  • concatenate and append via pandas (computer runs out swap memory [10G], also tried multiprocessing)
  • put the files in a local sqlite3 database (too many columns)

Thanks in advance Lukas

CodePudding user response:

Kourosh's answer is completely appropriate, I would only add for the first file in a directory argument header=True in the to_csv() function, and for each subsequent file we can stay with header=False.

CodePudding user response:

If all worksheets have the same columns:

import pandas as pd
import os
from pathlib import Path
    
    
directory = Path('src/data') #the directory you want to use'
    
for filename in os.listdir(directory):
    if filename.endswith(".xlsx"):
        pd.read_excel(directory / filename).to_csv('src/Final.csv', header=False, mode='a')

CodePudding user response:

If you want to have all data gathered together in one file you can use the following script:

  1. Put all excel workbooks (i.e. excel files) to be processed into a folder (see variable paths).

  2. Get the paths of all workbooks in that folder using glob.glob.

  3. Return all worksheets of each workbook with read_excel(path, sheet_name=None) and prepare them for merging.

  4. Merge all worksheets with pd.concat.

  5. Export the final output, e.g., with to_csv or to_excel (please note importantly that there is a row limit in Excel files).

    import pandas as pd
    import glob
    
    paths = glob.glob(r"C:\excelfiles\*.xlsx")
    path_save = r"finished.xlsx"
    
    df_lst = [pd.read_excel(path, sheet_name=None).values() for path in paths]
    df_lst = [y.transpose().reset_index().transpose() for x in df_lst for y in x]
    df_result = pd.concat(df_lst, ignore_index=True)
    df_result.to_excel(path_save, index=False, header=False)
    

Example with data
Assume some Excel workbooks with some worksheets.

Worksheet1:

a   b   c
foo cor wal
bar gra plu
baz ult xyz
qux ply thu

Worksheet2:

u   v   w   x   y   z
12  92  86  22      80
23  29      74      21
16  10  75  67  61  99

Worksheet3:

I   II  III IV
1   5   9   1
2   6   0   6
3   7       3
4   8   2   0

Final output (after executing this snippet, i.e. after to_excel):

a   b   c
foo cor wal
bar gra plu
baz ult xyz
qux ply thu
u   v   w   x   y   z
12  92  86  22      80
23  29      74      21
16  10  75  67  61  99
I   II  III IV
1   5   9   1
2   6   0   6
3   7       3
4   8   2   0
  • Related