Home > Software engineering >  Concatenation of csv files based on filename
Concatenation of csv files based on filename

Time:03-11

I am new to python and would like to get some ideas on concatenating the csv files based on same filename(eg: 1646768170). I am writing a code and quite unsure on how to proceed.

The folder contains these csv files:

a1_1646768170.csv
a2_1646768171.csv
a3_1646768171.csv
a4_1646768171.csv
a5_1646768172.csv
a6_1646768172.csv
a7_1646768173.csv
a8_1646768174.csv
a9_1646768174.csv
a10_1646768174.csv
a11_1646768175.csv
a12_1646768175.csv
a13_1646768176.csv
a1_1646768170.csv

Basically what I am trying to do is to concatenate(pd.concat)those csv files together which have the same filename. For eg a2_1646768171.csv, a3_1646768171.csv, a4_1646768171.csv should be concatenated as they have common name(1646768171). a7_1646768173.csv should be kept as it is.

output should look something like this

a1_1646768170.csv,
xxx_1646768171.csv,
xxx_1646768172.csv,
a7_1646768173.csv,
xxx_1646768174.csv,
xxx_1646768175.csv,
a13_1646768176.csv

Any help is really appreciated. Thanks.

CodePudding user response:

The following approach should work:

  1. Use glob.glob() to first iterate over all of the CSV files in your folder.
  2. Use split() to extract the timestamp part of each filename and use this to build a dictionary of timestamps to lists of files (using a defaultdict(list)). Note: this assumes all filenames have the same format.
  3. Iterate over the dictionary to return each timestamp and matching list of filenames.
  4. Use Pandas to load each matching CSV file into a list of dataframes.
  5. Concatenate all the matching dataframes
  6. Write them to a single CSV file using the timestamp as the filename.

For example:

from collections import defaultdict
import pandas as pd
import glob

by_timestamp = defaultdict(list)   # e.g. {'1646768175' : ['a.csv', 'b.csv']}

for filename in glob.glob('a*_*.csv'):
    timestamp = filename.split('_')[1].split('.')[0]
    by_timestamp[timestamp].append(filename)

for timestamp, filenames in by_timestamp.items():
    dfs = [pd.read_csv(fn) for fn in filenames]
    df = pd.concat(dfs)
    df.to_csv(f'{timestamp}.csv', index=False)

  • Related