so here's my problem:
I have two CSV files with each files having around 500 000 lines.
File 1 looks like this:
ID|NAME|OTHER INFO
353253453|LAURENT|STUFF 1
563636345|MARK|OTHERS
786970908|GEORGES|THINGS
File 2 looks like this:
LOCATION;ID_PERSON;PHONE
CA;786970908;555555
NY;353253453;555666
So what I have to do is look look for the lines where there are the same IDs, and add the line from file 2 to the end of corresponding line from file 1 in a new file, and if there's no corresponding IDs, add empty columns, like this:
ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE
353253453;LAURENT;STUFF 1;NY;353253453;555666
563636345;MARK;OTHERS;;;
786970908;GEORGES;THINGS;CA;786970908;555555
File 1 is the primary one if that makes sense.
The thing is I have found a solution but it takes way too long since for each lines of file 1 I loop through file 2.
Here's my code:
input1 = open(filename1, 'r', errors='ignore')
input2 = open(filename2, 'r', errors='ignore')
output = open('result.csv', 'w', newline='')
for line1 in input1:
line_splitted = line1.split("|")
id_1 = line_splitted[0]
index = 0
find = False
for line2 in file2:
file2_splitted = line2.split(";")
if id_1 in file2_splitted[1]:
output.write((";").join(line1.split("|")) line2)
find = True
file2.remove(line2)
break
index =1
if index == len(file2) and find == True:
output.write((";").join(line1.split("|")))
for j in range(nbr_col_2):
output.write(";")
output.write("\n")
So I was wondering if there is a faster way to do that, or if I just have to be patient, because right now after 20 minutes, only 20000 lines have been written...
CodePudding user response:
First read file2 line by line in order to build the lookup
dict.
Then read file1 line by line, lookup in the dict for the ID key, build the output line and write to the output file.
Should be quite efficient with complexity = O(n). Only the dict consumes a little bit of memory. All the file processing is "stream-based".
with open("file2.txt") as f:
lookup = {}
f.readline() # skip header
while True:
line = f.readline().rstrip()
if not line:
break
fields = line.split(";")
lookup[fields[1]] = line
with open("file1.txt") as f, open("output.txt", "w") as out:
f.readline() # skip header
out.write("ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE\n")
while True:
line_in = f.readline().rstrip()
if not line_in:
break
fields = line_in.split("|")
line_out = ";".join(fields)
if found := lookup.get(fields[0]):
line_out = ";" found
else:
line_out = ";;;"
out.write(line_out "\n")
CodePudding user response:
As Alex pointed out in his comment, you can merge both files using pandas.
import pandas as pd
# Load files
file_1 = pd.read_csv("file_1.csv", index_col=0, delimiter="|")
file_2 = pd.read_csv("file_2.csv", index_col=1, delimiter=";")
# Rename PERSON_ID as ID
file_2.index.name = "ID"
# Merge files
file_3 = file_1.merge(file_2, how="left", on="ID")
file_3.to_csv("file_3.csv")
Using your examples file_3.csv
looks like this:
ID,NAME,OTHER INFO,LOCATION,PHONE
353253453,LAURENT,STUFF 1,NY,555666.0
563636345,MARK,OTHERS,,
786970908,GEORGES,THINGS,CA,555555.0
Extra
By the way, if you are not familiar with pandas, this is a great introductory course: Learn Pandas Tutorials
CodePudding user response:
You can create indexing to prevent iterating over file2 each time. Do this by creating a dictionary from file2 and retrieving each related item of it by calling its index.
file1 = open(filename1, 'r', errors='ignore')
file2 = open(filename2, 'r', errors='ignore')
output = open('result.csv', 'w', newline='')
indexed_data = {}
for line2 in file2.readlines()[1:]:
data2 = line2.rstrip('\n').split(";")
indexed_data[data2[1]] = {
'Location': data2[0],
'Phone': data2[2],
}
output.write('ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE\n')
for line1 in file1.readlines()[1:]:
data1 = line1.rstrip('\n').split("|")
if data1[0] in indexed_data:
output.write(f'{data1[0]};{data1[1]};{data1[2]};{indexed_data[data1[0]]["Location"]};{data1[0]};{indexed_data[data1[0]]["Phone"]}\n')
else:
output.write(f'{data1[0]};{data1[1]};{data1[2]};;;\n')