Home > Software engineering >  How to find duplicate files ( binary or text ) in directories recursively ( based on content and NOT
How to find duplicate files ( binary or text ) in directories recursively ( based on content and NOT

Time:05-11

I have a directory with sub-directories which contains text or binary files ( like pictures ). I need to find duplicate files which can be in different sub-directories and with different names. So, I need to use some algorithm which would look inside the files and NOT rely on file name, or length of file.

CodePudding user response:

I could come up with a quick solution. I know this code can be written much better but functionality wise its working perfect. I even tested it on jpeg, gif files.

public static Map<String, List<File>> mapFilesHash = new HashMap<String, List<File>>();

public static MessageDigest md ;
static {
    try {
    md = MessageDigest.getInstance("MD5");
    } catch (Exception ex) {}
}

private static String checksum(File file) throws IOException {
    FileInputStream fis = new FileInputStream(file);
    byte[] byteArray = new byte[1024];
    int bytesCount = 0;
    while ((bytesCount = fis.read(byteArray)) != -1) {
        md.update(byteArray, 0, bytesCount);
    }
    fis.close();
    byte[] bytes = md.digest();
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < bytes.length; i  ) {
        sb.append(Integer.toString((bytes[i] & 0xff)   0x100, 16).substring(1));
    }
    return sb.toString();
}


public static void findDuplicateFiles(File rootDir) throws Exception {
    iterateOverDirectory(rootDir);
    System.out.println("based on hash " mapFilesHash.size());
    for (List<File> files: mapFilesHash.values()) {
        if (files.size() > 1 ) {
            System.out.println(files);
        }
    }
    
}

private static void iterateOverDirectory (File rootDir) throws Exception {
    for (File file : rootDir.listFiles()) {
        if (file.isDirectory()) {
            iterateOverDirectory(file);
        } else {
            if (mapFilesSize.get(file.length()) == null) {
                mapFilesSize.put(file.length(), new ArrayList<>());
            }
            mapFilesSize.get(file.length()).add(file);

            String md5hash = checksum(file);
            if (mapFilesHash.get(md5hash) == null) {
                mapFilesHash.put(md5hash, new ArrayList<>());
            }
            mapFilesHash.get(md5hash).add(file);
        }
    }
}

CodePudding user response:

Without mapFilesSize your method iterateOverDirectory can became:

private static void iterateOverDirectory(File rootDir) throws Exception {
    for (File file : rootDir.listFiles()) {
        if (file.isDirectory()) {
            iterateOverDirectory(file);
        }
        else {
            mapFilesHash.computeIfAbsent(checksum(file), k -> new ArrayList<>()).add(file);
        }
    }
}
  • Related