Concatenation of csv files based on filename-CodePudding

I am new to python and would like to get some ideas on concatenating the csv files based on same filename(eg: 1646768170). I am writing a code and quite unsure on how to proceed.

The folder contains these csv files:

a1_1646768170.csv
a2_1646768171.csv
a3_1646768171.csv
a4_1646768171.csv
a5_1646768172.csv
a6_1646768172.csv
a7_1646768173.csv
a8_1646768174.csv
a9_1646768174.csv
a10_1646768174.csv
a11_1646768175.csv
a12_1646768175.csv
a13_1646768176.csv
a1_1646768170.csv

Basically what I am trying to do is to concatenate(pd.concat)those csv files together which have the same filename. For eg a2_1646768171.csv, a3_1646768171.csv, a4_1646768171.csv should be concatenated as they have common name(1646768171). a7_1646768173.csv should be kept as it is.

output should look something like this

a1_1646768170.csv,
xxx_1646768171.csv,
xxx_1646768172.csv,
a7_1646768173.csv,
xxx_1646768174.csv,
xxx_1646768175.csv,
a13_1646768176.csv

Any help is really appreciated. Thanks.

CodePudding user response：

The following approach should work:

Use glob.glob() to first iterate over all of the CSV files in your folder.
Use split() to extract the timestamp part of each filename and use this to build a dictionary of timestamps to lists of files (using a defaultdict(list)). Note: this assumes all filenames have the same format.
Iterate over the dictionary to return each timestamp and matching list of filenames.
Use Pandas to load each matching CSV file into a list of dataframes.
Concatenate all the matching dataframes
Write them to a single CSV file using the timestamp as the filename.

For example:

from collections import defaultdict
import pandas as pd
import glob

by_timestamp = defaultdict(list)   # e.g. {'1646768175' : ['a.csv', 'b.csv']}

for filename in glob.glob('a*_*.csv'):
    timestamp = filename.split('_')[1].split('.')[0]
    by_timestamp[timestamp].append(filename)

for timestamp, filenames in by_timestamp.items():
    dfs = [pd.read_csv(fn) for fn in filenames]
    df = pd.concat(dfs)
    df.to_csv(f'{timestamp}.csv', index=False)