Home > Enterprise >  Filter out duplicate files (by content) in list of files in Java
Filter out duplicate files (by content) in list of files in Java

Time:09-19

I want to filter out duplicate files (by their content) in a list of files in Java 8.

eg.

List<File> files = Arrays.asList(new File("a.csv"), new File("b.csv"));

Now let's say that a.csv and b.csv have the same contents. What'd be the best way to retain only the files which have unique content from this list? (eg. either a.csv or b.csv should remain in the list).

We can use file checksums to filter the unique files out. Is there a better way?

CodePudding user response:

As you said, use e.g. an MD5 checksum.

File file = new File("foo.bar");
MessageDigest md = MessageDigest.getInstance("MD5");
String checksum = checksum(md, file);

To look into each file would also be possible but expensive.

CodePudding user response:

To avoid fully reading every file to compute checksums, you can also collect files into groups of same length and then you only need to compare groups sizes > 1 with Files.mismatch.

Alternatively as you require older JDK you can use checksum or some other file comparison utility to determine the duplicate contents in place of Files.mismatch.

  • Related