Find MD5 hash of files inside a tar.gz file in java without extracting it-CodePudding

I have a huge tar.gz file with lots of images in it. I need to find the md5 hash of each images. I am not able to find hash of images inside the tar file but same code works for normal folders and images. Is there any way to find hash without extracting the tar?

public static String digestAndBuildImageEntry(Path filePath) throws NoSuchAlgorithmException {
            try (InputStream is = Files.newInputStream(filePath);
                    BufferedInputStream buffered = new BufferedInputStream(is)) {

                byte[] data = Files.readAllBytes(filePath);
                byte[] hashByte = MessageDigest.getInstance("MD5").digest(data);

                String hash = hashByte.toString();
                return hash;
            } catch (Exception ex) {
                return null;
            }
        }

I get below exception when i run this code

Caused by: java.nio.file.FileSystemException: /Users/myuser/old/file.tar.gz/1.jpg: Not a directory
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
    at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
    at java.nio.file.Files.readAttributes(Files.java:1737)
    at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
    at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
    at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322)
    at java.nio.file.FileTreeIterator.<init>(FileTreeIterator.java:72)
    at java.nio.file.Files.walk(Files.java:3574)
    at java.nio.file.Files.walk(Files.java:3625)
    at com.example.demo.ImageDeduplication.listFiles(ImageDeduplication.java:78)
    at com.example.demo.SparkSQL.lambda$1(SparkSQL.java:82)
    at org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:775)
    ... 17 more

Below Path variables worked

/Users/myuser/old/1.jpg - worked
/Users/myuser/old/ - able to iterate and get all file inside the folder
/Users/myuser/old/file.tar.gz - gives the hash of the entire tar file

Not working for

/Users/myuser/old/file.tar.gz/1.jpg - says not a directory

CodePudding user response：

Apache Commons Compress has classes that can stream tar.gz format. From examples and docs it would be something like this:

try (InputStream fi = Files.newInputStream(Paths.get("my.tar.gz"));
     InputStream bi = new BufferedInputStream(fi);
     InputStream gzi = new GzipCompressorInputStream(bi);
     TarArchiveInputStream tarInput = new TarArchiveInputStream(gzi)) {
    TarArchiveEntry entry = tarInput.getNextTarEntry();
    
    // here you can read specific file's content and do md5 computation
    byte[] content = new byte[entry.getSize()];
    tarInput.read(content, offset, content.length - offset);
}

Another option to quickly access files inside of tar.gz is to mount it as virtual file system by commons-vfs

CodePudding user response：

Latest version of common compress has TarFile class which provides random access to the files and inputstream. We can get the TarArchiveEntry of each files as a list and get the corresponding inputstream from the exposed method in TarFile class. Below code worked for me.

public  static Map<String,String> getMD5HashMap(String path) throws Exception {
         Map<String,String> map = new ConcurrentHashMap<>();
         FileInputStream in = new FileInputStream(path);
         GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
         byte[] bytes = IOUtils.toByteArray(gzIn);
        
        TarFile tarFile = new TarFile(bytes);
        for(TarArchiveEntry tarArchiveEntry:tarFile.getEntries()) {
            
            if(tarArchiveEntry.isFile()) {
                try(InputStream is = tarFile.getInputStream(tarArchiveEntry);
                        BufferedInputStream buffered = new BufferedInputStream(is)){
                    
                    String hash = DigestUtils.md5Hex(buffered);
                    map.put(tarArchiveEntry.getName(), hash);
                    System.out.println(hash);
                }
            }
            
        }
        return map;
    
    }