Home > Enterprise >  Python: How to read globbed CSV files in batches, and store in a DataFrame
Python: How to read globbed CSV files in batches, and store in a DataFrame

Time:07-02

I have a large number of .csv files in a folder. All .csv files have the same column names. The below code merges all the .csv files. But I have to merge the top 10 .csv files in one DataFrame after that 11 to 20 in the next step and so on... The solution 1 and solution 2 are suitable if file names are numeric but in my case file names are not following any pattern.

# Merge .csv files in one place
import glob
import os
import pandas as pd  

path = r'D:\Course\Research\Data\2017-21'  
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)

CodePudding user response:

Here's a suggestion that is using islice() from the standard library module itertools to fetch chunks of up to 10 files:

from pathlib import Path
from itertools import islice
import pandas as pd

csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
    files = list(islice(csv_files, 10))
    if not files:
        break
    dfs = (pd.read_csv(file) for file in files)
    merged_df = pd.concat(dfs, ignore_index=True)
    # Do whatever you want to do with merged_df
    print(merged_df)

(I'm also using the standard library module pathlib because it's more convenient.)

CodePudding user response:

Further to my comment above, here is a more simple solution.

  • All required CSV files are collected by glob. In its current state, the list is not sorted, but can be according to your requirements
  • The list of files is iterated in 10-file-chunks
  • Each chunk is read and concatenated together into the merged DataFrame: dfm
  • Do whatever you like with the DataFrame
  • The to_csv example uses a random 4-byte hex string to ensure uniqueness* over the output files

*Note: This is not guaranteed uniqueness, but will suffice with the 50 sample data files I was using.

Sample code:

import os
import pandas as pd
from glob import glob

dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv'))  # 50 CSV files

for i in range(0, len(files), 10):
    dfm = pd.concat(pd.read_csv(f) for f in files[i:i 10])
    # Do whatever you want with the merged DataFrame.
    print(dfm.head(10), dfm.shape)
    print('\n')
    # Write to CSV?
    dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)

Output:

The following is a sample output from the print statements:

      col1     col2     col3     col4
0   file49   file49   file49   file49
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file30   file30   file30   file30
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)
    
...
      col1     col2     col3     col4
0   file14   file14   file14   file14
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file42   file42   file42   file42
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)

CSV file list:

merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv
  • Related