Is there a faster way of reading two files line by line, then adding one line at the end of the othe-CodePudding

so here's my problem:

I have two CSV files with each files having around 500 000 lines.

File 1 looks like this:

ID|NAME|OTHER INFO
353253453|LAURENT|STUFF 1
563636345|MARK|OTHERS
786970908|GEORGES|THINGS

File 2 looks like this:

LOCATION;ID_PERSON;PHONE
CA;786970908;555555
NY;353253453;555666

So what I have to do is look look for the lines where there are the same IDs, and add the line from file 2 to the end of corresponding line from file 1 in a new file, and if there's no corresponding IDs, add empty columns, like this:

ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE
353253453;LAURENT;STUFF 1;NY;353253453;555666
563636345;MARK;OTHERS;;;
786970908;GEORGES;THINGS;CA;786970908;555555

File 1 is the primary one if that makes sense.

The thing is I have found a solution but it takes way too long since for each lines of file 1 I loop through file 2.

Here's my code:

input1 = open(filename1, 'r', errors='ignore')
input2 = open(filename2, 'r', errors='ignore')
output = open('result.csv', 'w', newline='')

for line1 in input1:
    line_splitted = line1.split("|")
    id_1 = line_splitted[0]
    index = 0
    find = False
    for line2 in file2:
        file2_splitted = line2.split(";")
        if id_1 in file2_splitted[1]:
                output.write((";").join(line1.split("|")) line2)
                find = True
                file2.remove(line2)
                break
         index =1
         if index == len(file2) and find == True:
                output.write((";").join(line1.split("|")))
                for j in range(nbr_col_2):
                    output.write(";")
                output.write("\n")

So I was wondering if there is a faster way to do that, or if I just have to be patient, because right now after 20 minutes, only 20000 lines have been written...

CodePudding user response：

First read file2 line by line in order to build the lookup dict.

Then read file1 line by line, lookup in the dict for the ID key, build the output line and write to the output file.

Should be quite efficient with complexity = O(n). Only the dict consumes a little bit of memory. All the file processing is "stream-based".

with open("file2.txt") as f:
    lookup = {}
    f.readline()  # skip header
    while True:
        line = f.readline().rstrip()
        if not line:
            break
        fields = line.split(";")
        lookup[fields[1]] = line

with open("file1.txt") as f, open("output.txt", "w") as out:
    f.readline()  # skip header
    out.write("ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE\n")
    while True:
        line_in = f.readline().rstrip()
        if not line_in:
            break
        fields = line_in.split("|")
        line_out = ";".join(fields)
        if found := lookup.get(fields[0]):
            line_out  = ";"   found
        else:
            line_out  = ";;;"
        out.write(line_out   "\n")

CodePudding user response：

As Alex pointed out in his comment, you can merge both files using pandas.

import pandas as pd

# Load files
file_1 = pd.read_csv("file_1.csv", index_col=0, delimiter="|")
file_2 = pd.read_csv("file_2.csv", index_col=1, delimiter=";")

# Rename PERSON_ID as ID
file_2.index.name = "ID"

# Merge files
file_3 = file_1.merge(file_2, how="left", on="ID")
file_3.to_csv("file_3.csv")

Using your examples file_3.csv looks like this:

ID,NAME,OTHER INFO,LOCATION,PHONE
353253453,LAURENT,STUFF 1,NY,555666.0
563636345,MARK,OTHERS,,
786970908,GEORGES,THINGS,CA,555555.0

Extra

By the way, if you are not familiar with pandas, this is a great introductory course: Learn Pandas Tutorials

CodePudding user response：

You can create indexing to prevent iterating over file2 each time. Do this by creating a dictionary from file2 and retrieving each related item of it by calling its index.

file1 = open(filename1, 'r', errors='ignore')
file2 = open(filename2, 'r', errors='ignore')
output = open('result.csv', 'w', newline='')

indexed_data = {}
for line2 in file2.readlines()[1:]:
    data2 = line2.rstrip('\n').split(";")
    indexed_data[data2[1]] = {
        'Location': data2[0],
        'Phone': data2[2],
    }

output.write('ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE\n')

for line1 in file1.readlines()[1:]:
    data1 = line1.rstrip('\n').split("|")
    if data1[0] in indexed_data:
        output.write(f'{data1[0]};{data1[1]};{data1[2]};{indexed_data[data1[0]]["Location"]};{data1[0]};{indexed_data[data1[0]]["Phone"]}\n')
    else:
        output.write(f'{data1[0]};{data1[1]};{data1[2]};;;\n')