By referring the code, given in solution in below link, not getting the correct SHA1 hash for 2nd onwards files in loop. Why saying incorrect because
Using the code given below: -
CORRECT -> When trying to generate the SHA1 hash for same file individually (by executing code twice) then getting different SHA1 hash (correct) and
INCORRECT -> When generating hash for multiple files in single execution including this file also then getting different hash (incorrect) for this file ->
Please advice if anything to modify in this code or need to opt any other approach?
Code written by referring link given at bottom ->
import glob
import hashlib
import os
path = input("Please provide path to search for file pattern (search will be in this path sub-directories also: ")
filepattern = input("Please provide the file pattern to search in given path. Example *.jar, *abc*.jar.: ")
assert os.path.exists(path), "I did not find the path " str(path)
path = path.rstrip("/")
tocheck = (f'{path}/**/{filepattern}')
hash_obj = hashlib.sha1()
searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
print(f'{file}')
try:
checksum = ""
file_for_sha1 = ""
file_for_sha1 = open(file, 'rb')
hash_obj.update(file_for_sha1.read())
checksum = hash_obj.hexdigest()
print(f'sha1 for file ({file})= {checksum}')
finally:
file_for_sha1.close()
Example file -> abc.txt with below text created at /home/test/git/reader/cabin/: - Hi This is to test SHA1 code.
and then this file copied to one more location i.e. /home/test/git/reader/check/cabin/
Linux console output showing same SHA1 for both files: -
:~/git/reader/check/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc abc.txt
:~/git/reader/check/cabin$ cd ../..
:~/git/reader$ cd cabin/
:~/git/reader/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc abc.txt
Code in loop in single execution - generating two different SHA1 for this abc.txt file from both locations: -
- sha1 for file (/home/test/git/reader/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc
- sha1 for file (/home/test/git/reader/check/cabin/abc.txt)= a4691598ea25ea4c7404369a685725115c7f305b
Code executed twice for same file by giving respective location (means one file at a time) then generating same and correct SHA1 hash:
sha1 for file (/home/test/git/reader/check/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc
sha1 for file (/home/test/git/reader/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc
Referred code link -> Generating one MD5/SHA1 checksum of multiple files in Python
CodePudding user response:
To quote the docs on the update
method
Repeated calls are equivalent to a single call with the concatenation of all the arguments:
m.update(a); m.update(b)
is equivalent tom.update(a b)
.
So instead of finding the hash of both files separately, you're finding the hash of both files concatenated. That is what the question you've linked is doing - a single hash for multiple files. You want a hash for each file, so instead of using the update
method multiple times on the same hash_obj
instance, create a new instance for each file, so
hash_obj = hashlib.sha1()
searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
print(f'{file}')
try:
...
hash_obj.update(file_for_sha1.read())
will become
searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
print(f'{file}')
try:
hash_obj = hashlib.sha1()
...
hash_obj.update(file_for_sha1.read())