I have a large number of .csv files in a folder. All .csv files have the same column names. The below code merges all the .csv files. But I have to merge the top 10 .csv files in one DataFrame after that 11 to 20 in the next step and so on... The solution 1 and solution 2 are suitable if file names are numeric but in my case file names are not following any pattern.
# Merge .csv files in one place
import glob
import os
import pandas as pd
path = r'D:\Course\Research\Data\2017-21'
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)
CodePudding user response:
Here's a suggestion that is using islice()
from the standard library module itertools
to fetch chunks of up to 10 files:
from pathlib import Path
from itertools import islice
import pandas as pd
csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
files = list(islice(csv_files, 10))
if not files:
break
dfs = (pd.read_csv(file) for file in files)
merged_df = pd.concat(dfs, ignore_index=True)
# Do whatever you want to do with merged_df
print(merged_df)
(I'm also using the standard library module pathlib
because it's more convenient.)
CodePudding user response:
Further to my comment above, here is a more simple solution.
- All required CSV files are collected by
glob
. In its current state, the list is not sorted, but can be according to your requirements - The list of files is iterated in 10-file-chunks
- Each chunk is read and concatenated together into the merged DataFrame:
dfm
- Do whatever you like with the DataFrame
- The
to_csv
example uses a random 4-byte hex string to ensure uniqueness* over the output files
*Note: This is not guaranteed uniqueness, but will suffice with the 50 sample data files I was using.
Sample code:
import os
import pandas as pd
from glob import glob
dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv')) # 50 CSV files
for i in range(0, len(files), 10):
dfm = pd.concat(pd.read_csv(f) for f in files[i:i 10])
# Do whatever you want with the merged DataFrame.
print(dfm.head(10), dfm.shape)
print('\n')
# Write to CSV?
dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)
Output:
The following is a sample output from the print statements:
col1 col2 col3 col4
0 file49 file49 file49 file49
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4
4 data4.1 data4.2 data4.3 data4.4
5 data5.1 data5.2 data5.3 data5.4
0 file30 file30 file30 file30
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4 (60, 4)
...
col1 col2 col3 col4
0 file14 file14 file14 file14
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4
4 data4.1 data4.2 data4.3 data4.4
5 data5.1 data5.2 data5.3 data5.4
0 file42 file42 file42 file42
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4 (60, 4)
CSV file list:
merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv