Home > Blockchain >  Merge 2 or more csv files with time overlap data
Merge 2 or more csv files with time overlap data

Time:04-27

How do I merge 2 or more csv files with time overlap data? For e.g.,

data1 is

Time u v w
0.24001821                          0        0.009301949        0
0.6400364                           0        0.009311552        0
0.84005458                          0        0.0093211568        0
0.94034343                          0        0.0094739951        0

data2 is

Time u v w
0.74041502                          0        0.0095119512        0
0.84043291                          0        0.0095214359        0
0.94045075                          0        0.0095309047        0
1.2404686                           0        0.0095403752        0

What I want is:

Time u v w
0.24001821                          0        0.009301949        0
0.6400364                           0        0.009311552        0
0.74041502                          0        0.0095119512        0
0.84043291                          0        0.0095214359        0
0.94045075                          0        0.0095309047        0
1.2404686                           0        0.0095403752        0

So the last few rows of data from the 1st csv file is deleted and the 2nd csv file is merged so that the time sequence is increasing.

How can that be done? Thanks.

CodePudding user response:

If both files are individually ordered by time already. Using for loop is enough:

# csv cell should be separated by comma, change if required
dilimeter = ','
# open files and read lines
f1 = open('data1.csv', 'r')
f1_lines = f1.readlines()
f1.close()
f2 = open('data2.csv', 'r')
f2_lines = f2.readlines()
f2.close()

# extract header
output_lines = [f1_lines[0]]

# start scanning frome line 2 of both files (line 1 is header)
f1_index = 1
f2_index = 1
while True:
    # all data1 are processed, append remaining lines from data2
    if f1_index >= len(f1_lines):
        output_lines  = f2_lines[f2_index:]
        break
    # all data2 are processed, append remaining lines from data1
    if f2_index >= len(f2_lines):
        output_lines  = f1_lines[f1_index:]
        break
    f1_line_time = float(f1_lines[f1_index].split(dilimeter)[0]) # get the time cell of data1
    f2_line_time = float(f2_lines[f2_index].split(dilimeter)[0]) # get the time cell of data2
    if f1_line_time < f2_line_time:
        output_lines.append(f1_lines[f1_index])
        f1_index  = 1
    elif f1_lines == f2_line_time:
        # if they are equal in time, pick one
        output_lines.append(f1_lines[f1_index])
        f1_index  = 1
        f2_index  = 1
    else:
        output_lines.append(f2_lines[f2_index])
        f2_index  = 1
f_output = open('out.csv', 'w')
f_output.write(''.join(output_lines))
f_output.close()

CodePudding user response:

Another option:

import csv

delimiter = " "
with open("data1.csv", "r") as fin1,\
     open("data2.csv", "r") as fin2,\
     open("data.csv", "w") as fout:
    reader1 = csv.reader(fin1, delimiter=delimiter)
    reader2 = csv.reader(fin2, delimiter=delimiter)
    writer = csv.writer(fout, delimiter=delimiter)
    next(reader2)
    first_row = next(reader2)
    start2 = float(first_row[0])
    writer.writerow(next(reader1))
    for row in reader1:
        if float(row[0]) < start2:
            writer.writerow(row)
        else:
            break
    writer.writerow(first_row)
    writer.writerows(reader2)

Assumption is that the files are already ordered individually:

  • First take the first data row of data2.csv and convert its first entry into a float start2.
  • With that in mind write all rows from data1.csv with a time less than start2 into the new file data.csv, and break out of the loop once the condition isn't met anymore.
  • Then write the already extracted first data row from data2.csv to data.csv, and afterwards write the rest of data2.csv to data.csv.

Result for data1.csv:

Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0

data2.csv:

Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0

is

Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0

CodePudding user response:

Python has an excellent built in library function to help with this called heapq.merge().

Assuming your data is space delimited, you could use this as follows:

from heapq import merge
import csv

with open('data1.csv') as f_data1, open('data2.csv') as f_data2, open('output.csv', 'w', newline='') as f_output:
    csv_data1 = csv.reader(f_data1, delimiter=' ', skipinitialspace=True)
    csv_data2 = csv.reader(f_data2, delimiter=' ', skipinitialspace=True)
    csv_output = csv.writer(f_output, delimiter=' ')
    
    header1 = next(csv_data1)
    header2 = next(csv_data2)
    csv_output.writerow(header1)
    
    for row in merge(csv_data1, csv_data2, key=lambda x: float(x[0])):
        csv_output.writerow(row)

This would produce a CSV output format as:

Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84005458 0 0.0093211568 0
0.84043291 0 0.0095214359 0
0.94034343 0 0.0094739951 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
  • Related