Join large set of CSV files where the header is the timestamp for the file-CodePudding

I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.

Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.

Ex.

file1.csv

2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0

file2.csv

2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0

Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.

Like this:

Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....

Any simple tools for this, or Python routines?

CodePudding user response：

Try the following approach:

from datetime import datetime
import csv
import glob

data = []
fieldnames = set()

for fn in glob.glob('file*.csv'):
    with open(fn) as f_input:
        csv_input = csv.reader(f_input, delimiter=';')
        timestamp = next(csv_input)[0]
        row = {'Timestamp' : timestamp}
        
        for device, error_code in csv_input:
            row[device] = error_code
            fieldnames.add(device)
            
        data.append(row)

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
    csv_output.writeheader()
    csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))

This gives output.csv as:

Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0

CodePudding user response：

Thanks for the reply on my first question here! I will also contribute with a solution for this problem.

I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.

import pandas as pd
import os

folder = 'in'

df_out = pd.DataFrame()

for filename in os.scandir(folder):
    if filename.is_file():
        print('Working on file'   filename.path)

        df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])

        # Transpose data with timestamp header to columns
        df_tranposed = df.T
        df_out = df_out.append(df_tranposed)


df_out.to_csv('output.csv')