I have a directory with roughly 100 excel files each ~50MB in size. Each file has multiple worksheets with inconsistent names.
I would like to concatenate them together.
What I have tried so far in python:
- concatenate and append via pandas (computer runs out swap memory [10G], also tried multiprocessing)
- put the files in a local sqlite3 database (too many columns)
Thanks in advance Lukas
CodePudding user response:
Kourosh's answer is completely appropriate, I would only add for the first file in a directory argument header=True
in the to_csv()
function, and for each subsequent file we can stay with header=False
.
CodePudding user response:
If all worksheets have the same columns:
import pandas as pd
import os
from pathlib import Path
directory = Path('src/data') #the directory you want to use'
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
pd.read_excel(directory / filename).to_csv('src/Final.csv', header=False, mode='a')
CodePudding user response:
If you want to have all data gathered together in one file you can use the following script:
Put all excel workbooks (i.e. excel files) to be processed into a folder (see variable
paths
).Get the paths of all workbooks in that folder using
glob.glob
.Return all worksheets of each workbook with
read_excel(path, sheet_name=None)
and prepare them for merging.Merge all worksheets with
pd.concat
.Export the final output, e.g., with
to_csv
orto_excel
(please note importantly that there is a row limit in Excel files).import pandas as pd import glob paths = glob.glob(r"C:\excelfiles\*.xlsx") path_save = r"finished.xlsx" df_lst = [pd.read_excel(path, sheet_name=None).values() for path in paths] df_lst = [y.transpose().reset_index().transpose() for x in df_lst for y in x] df_result = pd.concat(df_lst, ignore_index=True) df_result.to_excel(path_save, index=False, header=False)
Example with data
Assume some Excel workbooks with some worksheets.
Worksheet1:
a b c
foo cor wal
bar gra plu
baz ult xyz
qux ply thu
Worksheet2:
u v w x y z
12 92 86 22 80
23 29 74 21
16 10 75 67 61 99
Worksheet3:
I II III IV
1 5 9 1
2 6 0 6
3 7 3
4 8 2 0
Final output (after executing this snippet, i.e. after to_excel
):
a b c
foo cor wal
bar gra plu
baz ult xyz
qux ply thu
u v w x y z
12 92 86 22 80
23 29 74 21
16 10 75 67 61 99
I II III IV
1 5 9 1
2 6 0 6
3 7 3
4 8 2 0