Home > Enterprise >  How to Compare two large CSV file in java
How to Compare two large CSV file in java

Time:09-29

I need compare two large csv files and find differences.

First CSV file will be like:

c71f55b6c18248b8915d8a26
64b7d2d4eab74d7999a967c0
ceb792ad21054fe0a27ec410
95319566f9424c57ba2145f9
682a4fe26c154050b8f5c6f1
88e0209e2af74049ad9bf2bd
5c462b42763d41d7bb67029f
0ee74c227fc84e39a9ecc1da
66f7ab6f56374ba08d2fb92d
3ed793e35f9441b58562c9ba
baad81ac8ba54188afe63fb8
...

Each row has just one id, and total row count is approximately 5 Millions. Second CSV file will be like First one with total row count 3 Millions.

I need to remove ids of the second csv from the first csv and put them into a MongoDb. When i take all lines into memory then compare both CSVs file, I got out of memory error. I have 512Mb memory space and I will get at least 30 request in a day. Rows of CSV is changing 1Million-10Million. I can receive two request at same time and do same things simultaneously.

Is there any other way on this?

Thanks.

CodePudding user response:

If you need to manage the data in java you can use a Set as basic data structure to hold your data:

A collection that contains no duplicate elements

in particular in your case the best will to use an HashSet of strings because:

This class offers constant time performance for the basic operations (add, remove, contains and size)

it means that adding and removing items from an HashSet is not dependent on the number of items present in the HashSet. Holding 10.000.000 or strings of 24 characters can be done with about half giga of ram, so you can hold all in memory, but consider that 10.000.000 is your upper limit if you are limited by half giga of ram.

The code can be something like

Set<String> items = new HashSet<>();

...
// For each item in the first file (may be a loop or stream)
items.add(item);
...
// For each item in the second file (may be a loop or stream)
items.remove(item);
...
// Here the set contains all items of the first csv without items present also 
// in the second csv

CodePudding user response:

For performance reasons, you should keep a representation of the second file in memory, so you can loop through the first file, check whether the entry is contained in the second one, and if not, insert the entry into MongoDB.

The representation of the second file should:

  • be compact not to consume too much memory,
  • allow for a fast "contains" check.

Your data entries all seem to consist of exactly 24 hex digits. If that's true, you can represent them as 96-bit numbers instead of Strings. The most straightforward approach is

String entry = ...
BigInteger value = new BigInteger(entry,16);

Then, you use a Set<BigInteger> instead of a Set<String>, with considerably lower memory consumption. I'd try both HashSet and TreeSet, but I'm concerned that their memory overhead per entry might still be too much.

So, it might be necessary to create your own data structure, e.g. using the first (highest) 16 bits as an index into a size-65536 array where each element is a List of the file-two BigIntegers starting with that 16-bit value. This should give a low memory overhead, a decent contains() performance, and need at most 50 lines of code.

  • Related