I am new to python and would like to get some ideas on concatenating the csv files based on same filename(eg: 1646768170). I am writing a code and quite unsure on how to proceed.
The folder contains these csv files:
a1_1646768170.csv
a2_1646768171.csv
a3_1646768171.csv
a4_1646768171.csv
a5_1646768172.csv
a6_1646768172.csv
a7_1646768173.csv
a8_1646768174.csv
a9_1646768174.csv
a10_1646768174.csv
a11_1646768175.csv
a12_1646768175.csv
a13_1646768176.csv
a1_1646768170.csv
Basically what I am trying to do is to concatenate(pd.concat)those csv files together which have the same filename. For eg a2_1646768171.csv, a3_1646768171.csv, a4_1646768171.csv should be concatenated as they have common name(1646768171). a7_1646768173.csv should be kept as it is.
output should look something like this
a1_1646768170.csv,
xxx_1646768171.csv,
xxx_1646768172.csv,
a7_1646768173.csv,
xxx_1646768174.csv,
xxx_1646768175.csv,
a13_1646768176.csv
Any help is really appreciated. Thanks.
CodePudding user response:
The following approach should work:
- Use
glob.glob()
to first iterate over all of the CSV files in your folder. - Use
split()
to extract the timestamp part of each filename and use this to build a dictionary of timestamps to lists of files (using adefaultdict(list)
). Note: this assumes all filenames have the same format. - Iterate over the dictionary to return each timestamp and matching list of filenames.
- Use Pandas to load each matching CSV file into a list of dataframes.
- Concatenate all the matching dataframes
- Write them to a single CSV file using the timestamp as the filename.
For example:
from collections import defaultdict
import pandas as pd
import glob
by_timestamp = defaultdict(list) # e.g. {'1646768175' : ['a.csv', 'b.csv']}
for filename in glob.glob('a*_*.csv'):
timestamp = filename.split('_')[1].split('.')[0]
by_timestamp[timestamp].append(filename)
for timestamp, filenames in by_timestamp.items():
dfs = [pd.read_csv(fn) for fn in filenames]
df = pd.concat(dfs)
df.to_csv(f'{timestamp}.csv', index=False)