Home > Enterprise >  Join large set of CSV files where the header is the timestamp for the file
Join large set of CSV files where the header is the timestamp for the file

Time:05-06

I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.

Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.

Ex.

file1.csv

2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0

file2.csv

2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0

Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.

Like this:

Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....

Any simple tools for this, or Python routines?

CodePudding user response:

Try the following approach:

from datetime import datetime
import csv
import glob

data = []
fieldnames = set()

for fn in glob.glob('file*.csv'):
    with open(fn) as f_input:
        csv_input = csv.reader(f_input, delimiter=';')
        timestamp = next(csv_input)[0]
        row = {'Timestamp' : timestamp}
        
        for device, error_code in csv_input:
            row[device] = error_code
            fieldnames.add(device)
            
        data.append(row)

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
    csv_output.writeheader()
    csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))

This gives output.csv as:

Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0

CodePudding user response:

Thanks for the reply on my first question here! I will also contribute with a solution for this problem.

I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.

import pandas as pd
import os

folder = 'in'

df_out = pd.DataFrame()

for filename in os.scandir(folder):
    if filename.is_file():
        print('Working on file'   filename.path)

        df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])

        # Transpose data with timestamp header to columns
        df_tranposed = df.T
        df_out = df_out.append(df_tranposed)


df_out.to_csv('output.csv')
  • Related