How do I merge 2 or more csv files with time overlap data? For e.g.,
data1 is
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2 is
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
What I want is:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
So the last few rows of data from the 1st csv file is deleted and the 2nd csv file is merged so that the time sequence is increasing.
How can that be done? Thanks.
CodePudding user response:
If both files are individually ordered by time already. Using for loop is enough:
# csv cell should be separated by comma, change if required
dilimeter = ','
# open files and read lines
f1 = open('data1.csv', 'r')
f1_lines = f1.readlines()
f1.close()
f2 = open('data2.csv', 'r')
f2_lines = f2.readlines()
f2.close()
# extract header
output_lines = [f1_lines[0]]
# start scanning frome line 2 of both files (line 1 is header)
f1_index = 1
f2_index = 1
while True:
# all data1 are processed, append remaining lines from data2
if f1_index >= len(f1_lines):
output_lines = f2_lines[f2_index:]
break
# all data2 are processed, append remaining lines from data1
if f2_index >= len(f2_lines):
output_lines = f1_lines[f1_index:]
break
f1_line_time = float(f1_lines[f1_index].split(dilimeter)[0]) # get the time cell of data1
f2_line_time = float(f2_lines[f2_index].split(dilimeter)[0]) # get the time cell of data2
if f1_line_time < f2_line_time:
output_lines.append(f1_lines[f1_index])
f1_index = 1
elif f1_lines == f2_line_time:
# if they are equal in time, pick one
output_lines.append(f1_lines[f1_index])
f1_index = 1
f2_index = 1
else:
output_lines.append(f2_lines[f2_index])
f2_index = 1
f_output = open('out.csv', 'w')
f_output.write(''.join(output_lines))
f_output.close()
CodePudding user response:
Another option:
import csv
delimiter = " "
with open("data1.csv", "r") as fin1,\
open("data2.csv", "r") as fin2,\
open("data.csv", "w") as fout:
reader1 = csv.reader(fin1, delimiter=delimiter)
reader2 = csv.reader(fin2, delimiter=delimiter)
writer = csv.writer(fout, delimiter=delimiter)
next(reader2)
first_row = next(reader2)
start2 = float(first_row[0])
writer.writerow(next(reader1))
for row in reader1:
if float(row[0]) < start2:
writer.writerow(row)
else:
break
writer.writerow(first_row)
writer.writerows(reader2)
Assumption is that the files are already ordered individually:
- First take the first data row of
data2.csv
and convert its first entry into a floatstart2
. - With that in mind write all rows from
data1.csv
with a time less thanstart2
into the new filedata.csv
, and break out of the loop once the condition isn't met anymore. - Then write the already extracted first data row from
data2.csv
todata.csv
, and afterwards write the rest ofdata2.csv
todata.csv
.
Result for
data1.csv
:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2.csv
:
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
is
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
CodePudding user response:
Python has an excellent built in library function to help with this called heapq.merge()
.
Assuming your data is space delimited, you could use this as follows:
from heapq import merge
import csv
with open('data1.csv') as f_data1, open('data2.csv') as f_data2, open('output.csv', 'w', newline='') as f_output:
csv_data1 = csv.reader(f_data1, delimiter=' ', skipinitialspace=True)
csv_data2 = csv.reader(f_data2, delimiter=' ', skipinitialspace=True)
csv_output = csv.writer(f_output, delimiter=' ')
header1 = next(csv_data1)
header2 = next(csv_data2)
csv_output.writerow(header1)
for row in merge(csv_data1, csv_data2, key=lambda x: float(x[0])):
csv_output.writerow(row)
This would produce a CSV output format as:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84005458 0 0.0093211568 0
0.84043291 0 0.0095214359 0
0.94034343 0 0.0094739951 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0