Merge csv files if they have the same columns headers, if not split-CodePudding

I have a folder with several csv files (5k ), to work with them it would be ideal to have the same variable names and number of columns. But this is not the case.

To proceed for the cleaning, I would like to create some subfolders conditional on their columns. For example, if two or more csv have the same columns and variable names, create a subfolder with them.

So far I found how to combine all the files, but I don't know where to put the condition with the matching columns subfolders.

import glob
import pandas as pd

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]    

col_combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])

CodePudding user response：

To merge all CSV files together with the same header in a folder, the following approach could be used:

import csv
import glob

csv_files = {}      # (header as tuple) : csv.writer()
header_type_count = 1

for filename in glob.glob('*.csv'):
    with open(filename, newline='') as f_input:
        csv_input = csv.reader(f_input)
        header = tuple(next(csv_input))
        
        try:
            csv_files[header].writerows(csv_input)
        except KeyError:
            f_output = open(f'header_v{header_type_count:02}.csv', 'w', newline='')
            header_type_count  = 1
            csv_output = csv.writer(f_output)
            csv_files[header] = csv_output
            csv_output.writerow(header)
            csv_output.writerows(csv_input)

This works by keeping track of all of the different header types and allows them to be concatenated on the fly. For each new header type found, it opens a new output CSV file (e.g. header_v01.csv).

csv_files maps header types to open csv.writer() objects to allow extra rows to be written.

This approach avoids needing to hold all the data in memory at the same time.