I have a directory with sub-directories which contains text or binary files ( like pictures ). I need to find duplicate files which can be in different sub-directories and with different names. So, I need to use some algorithm which would look inside the files and NOT rely on file name, or length of file.
CodePudding user response:
I could come up with a quick solution. I know this code can be written much better but functionality wise its working perfect. I even tested it on jpeg, gif files.
public static Map<String, List<File>> mapFilesHash = new HashMap<String, List<File>>();
public static MessageDigest md ;
static {
try {
md = MessageDigest.getInstance("MD5");
} catch (Exception ex) {}
}
private static String checksum(File file) throws IOException {
FileInputStream fis = new FileInputStream(file);
byte[] byteArray = new byte[1024];
int bytesCount = 0;
while ((bytesCount = fis.read(byteArray)) != -1) {
md.update(byteArray, 0, bytesCount);
}
fis.close();
byte[] bytes = md.digest();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < bytes.length; i ) {
sb.append(Integer.toString((bytes[i] & 0xff) 0x100, 16).substring(1));
}
return sb.toString();
}
public static void findDuplicateFiles(File rootDir) throws Exception {
iterateOverDirectory(rootDir);
System.out.println("based on hash " mapFilesHash.size());
for (List<File> files: mapFilesHash.values()) {
if (files.size() > 1 ) {
System.out.println(files);
}
}
}
private static void iterateOverDirectory (File rootDir) throws Exception {
for (File file : rootDir.listFiles()) {
if (file.isDirectory()) {
iterateOverDirectory(file);
} else {
if (mapFilesSize.get(file.length()) == null) {
mapFilesSize.put(file.length(), new ArrayList<>());
}
mapFilesSize.get(file.length()).add(file);
String md5hash = checksum(file);
if (mapFilesHash.get(md5hash) == null) {
mapFilesHash.put(md5hash, new ArrayList<>());
}
mapFilesHash.get(md5hash).add(file);
}
}
}
CodePudding user response:
Without mapFilesSize
your method iterateOverDirectory
can became:
private static void iterateOverDirectory(File rootDir) throws Exception {
for (File file : rootDir.listFiles()) {
if (file.isDirectory()) {
iterateOverDirectory(file);
}
else {
mapFilesHash.computeIfAbsent(checksum(file), k -> new ArrayList<>()).add(file);
}
}
}