Home > front end >  python 3.9 - unable to get correct sha1 hash for multiple files in loop
python 3.9 - unable to get correct sha1 hash for multiple files in loop

Time:07-25

By referring the code, given in solution in below link, not getting the correct SHA1 hash for 2nd onwards files in loop. Why saying incorrect because

Using the code given below: -

  • CORRECT -> When trying to generate the SHA1 hash for same file individually (by executing code twice) then getting different SHA1 hash (correct) and

  • INCORRECT -> When generating hash for multiple files in single execution including this file also then getting different hash (incorrect) for this file ->

Please advice if anything to modify in this code or need to opt any other approach?

Code written by referring link given at bottom ->

import glob
import hashlib
import os

path = input("Please provide path to search for file pattern (search will be in this path sub-directories also: ")
filepattern = input("Please provide the file pattern to search in given path. Example *.jar, *abc*.jar.: ")
assert os.path.exists(path), "I did not find the path "   str(path)
path = path.rstrip("/")
tocheck = (f'{path}/**/{filepattern}')
hash_obj = hashlib.sha1()

searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
    print(f'{file}')
    try:
        checksum = ""
        file_for_sha1 = ""
        file_for_sha1 = open(file, 'rb')
        hash_obj.update(file_for_sha1.read())
        checksum = hash_obj.hexdigest()
        print(f'sha1 for file ({file})= {checksum}')
    finally:
        file_for_sha1.close()

Example file -> abc.txt with below text created at /home/test/git/reader/cabin/: - Hi This is to test SHA1 code.

and then this file copied to one more location i.e. /home/test/git/reader/check/cabin/

Linux console output showing same SHA1 for both files: -

:~/git/reader/check/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc  abc.txt
:~/git/reader/check/cabin$ cd ../..
:~/git/reader$ cd cabin/
:~/git/reader/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc  abc.txt

Code in loop in single execution - generating two different SHA1 for this abc.txt file from both locations: -

  • sha1 for file (/home/test/git/reader/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc
  • sha1 for file (/home/test/git/reader/check/cabin/abc.txt)= a4691598ea25ea4c7404369a685725115c7f305b

Code executed twice for same file by giving respective location (means one file at a time) then generating same and correct SHA1 hash:

  • sha1 for file (/home/test/git/reader/check/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc

  • sha1 for file (/home/test/git/reader/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc

Referred code link -> Generating one MD5/SHA1 checksum of multiple files in Python

CodePudding user response:

To quote the docs on the update method

Repeated calls are equivalent to a single call with the concatenation of all the arguments: m.update(a); m.update(b) is equivalent to m.update(a b).

So instead of finding the hash of both files separately, you're finding the hash of both files concatenated. That is what the question you've linked is doing - a single hash for multiple files. You want a hash for each file, so instead of using the update method multiple times on the same hash_obj instance, create a new instance for each file, so

hash_obj = hashlib.sha1()
searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
    print(f'{file}')
    try:
        ...
        hash_obj.update(file_for_sha1.read())

will become

searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
    print(f'{file}')
    try:
        hash_obj = hashlib.sha1()
        ...
        hash_obj.update(file_for_sha1.read())
  • Related