Python: How to read globbed CSV files in batches, and store in a DataFrame-CodePudding

I have a large number of .csv files in a folder. All .csv files have the same column names. The below code merges all the .csv files. But I have to merge the top 10 .csv files in one DataFrame after that 11 to 20 in the next step and so on... The solution 1 and solution 2 are suitable if file names are numeric but in my case file names are not following any pattern.

# Merge .csv files in one place
import glob
import os
import pandas as pd  

path = r'D:\Course\Research\Data\2017-21'  
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)

CodePudding user response：

Here's a suggestion that is using islice() from the standard library module itertools to fetch chunks of up to 10 files:

from pathlib import Path
from itertools import islice
import pandas as pd

csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
    files = list(islice(csv_files, 10))
    if not files:
        break
    dfs = (pd.read_csv(file) for file in files)
    merged_df = pd.concat(dfs, ignore_index=True)
    # Do whatever you want to do with merged_df
    print(merged_df)

(I'm also using the standard library module pathlib because it's more convenient.)

CodePudding user response：

Further to my comment above, here is a more simple solution.

All required CSV files are collected by glob. In its current state, the list is not sorted, but can be according to your requirements
The list of files is iterated in 10-file-chunks
Each chunk is read and concatenated together into the merged DataFrame: dfm
Do whatever you like with the DataFrame
The to_csv example uses a random 4-byte hex string to ensure uniqueness* over the output files

*Note: This is not guaranteed uniqueness, but will suffice with the 50 sample data files I was using.

Sample code:

import os
import pandas as pd
from glob import glob

dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv'))  # 50 CSV files

for i in range(0, len(files), 10):
    dfm = pd.concat(pd.read_csv(f) for f in files[i:i 10])
    # Do whatever you want with the merged DataFrame.
    print(dfm.head(10), dfm.shape)
    print('\n')
    # Write to CSV?
    dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)

Output:

The following is a sample output from the print statements:

      col1     col2     col3     col4
0   file49   file49   file49   file49
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file30   file30   file30   file30
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)
    
...
      col1     col2     col3     col4
0   file14   file14   file14   file14
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file42   file42   file42   file42
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)

CSV file list:

merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv