How to count unique csv/parquet rows?-CodePudding

Is there a better algorithm for counting unique rows in csv/parquet than writing to HashMap or reading one row and iterating through whole file(if file is big)?

CodePudding user response：

If you just want to count unique lines in a csv file, you could use Java NIO and streams:

Files.lines(Path.of("/path/to/file.csv"))
        .skip(1) // CSV Header
        .distinct()
        .count();

This way, the size of the file should not be a problem, since streams process data lazily, so the entire file is never fully loaded into memory.

CodePudding user response：

I think that this version might be a bit better (memory wise) than one that reads the lines in memory, or even than one that only keeps the unique rows in memory, as long as we know that the rows are big.

It will temporary store the hashCodes of each row, instead of storing the rows themselves (which should be better when the rows are big, please do correct me if I'm wrong here).

var uniqueCount = 0

//java.io.File
file.bufferedReader()
//`use` will close the bufferedReader once we are done with it (also in case of an exception)
.use { reader ->
    uniqueCount = reader
        //get a sequence of the lines (this is lazy, so it won't load all of them into memory)
        .lineSequence()
        //get the hashCode of each line, I'm assuming here that your rows are big, so it is cheaper to store the hashCodes
        //to note, this is lazy as well, kotlin does some magic with some of the lambda functions when it comes to sequences
        .map(String::hashCode)
        //remove duplicates
        .toSet()
        //count the unique ones
        .count()
}

Btw, there is also a version of lineSequence() called useLines() which has an imbedded use{} in it and still gives a sequence. But that one makes the code harder to read/understand IMO.