comparing large csv files-CodePudding

Which method is more efficient for comparing two large (8GB & 5GB) csv files? The output should contain every id that is not in file1.

The data is a single column with GUIDs.

Method 1:

df = pd.read_csv(file)
df1 = pd.read_csv(file1)

df = df.merge(df1, on=['id'], how="outer", indicator=True).query('_merge=="left_only"')
df['id'].to_csv(output_path, index=False)

Method 2:

with open(file1, 'r') as t1:
    file = set(t1)

with open(file, 'r') as t2, open(output_path, 'w') as outFile:
    for line in t2:
        if line not in file:
            outFile.write(line)

CodePudding user response：

What do you mean by efficiency? Certainly two major differences are as follows:

The first method, which pandas uses, needs to have all the data in memory. So you will need an amount of available memory to hold the data from the two csv files (note: 5 8gb may not be enough, but it will depend on the type of data in the csv files).
The second method takes advantage of python's generators, and reads the file line by line, loading into memory one line at a time.

So if you have the memory available to load the data into memory, it will certainly be faster to load all the data into memory and do the operations on the data in memory.

If you don't have enough memory available, the second method works but is definitely slower. a good compromise might be to read the file by chunk, loading into memory an amount of data that your hardware can handle.

Extras

To estimate the memory space used by your datframe you can read this nice post:

How to estimate how much memory a Pandas' DataFrame will need?

Here you can find approndments explaining how to read a file by chunk, with or without pandas

CodePudding user response：

If this is something you'll have to run multiple times, you can just wrap them with start = time.time() at the beginning and execution_time = time.time() - start at the end to compare speed. To compare memory, you can check out this package, memory_profiler