How to handle two one billion record for file comparison-CodePudding

I try to read two one billion record files from a Unix server for file comparison.

I tried in python with paramiko package but it is very slow to connect and read the Unix files. That's why I chose Java.
In Java when I read the file I am facing memory issues and performance issues.

My requirement: first read all records from Unix server file1, then read second file records from Unix server and finally compare the two files.

CodePudding user response：

Sounds you want to process huge files. Rule of thumb is they will exceed your RAM so never hope to get them read all at once.

Instead try to read meaningful chunks, process them, then forget them. Meaningful chunks could be characters, words, lines, expressions, objects.

CodePudding user response：

As you are working in UNIX, I would advise you to sort the files and use the diff commandline tool: UNIX' commandline commands are quite powerful. Please show us an excerpt of the files (maybe you might need cut or awk scripts too).