I want to check if the content of a file changed. My plan if to add a hash in the last line of the file.
Later on, I can read the file, hash it (hash everything except the last line) and compare it to the last line of the file (initial hash).
I cannot use the last modified date/time. I need to use a hash or any kind of coding stored inside the file. I use C# to code the app. What is the most reasoneble/easiest way of doing this? I don't know which of the the following would be a good match for me: Sha1,2,3 - crc16/32/64 - md5? I do not need the method to be quick or secure.
Thank you!
CodePudding user response:
It seems to me as if you're going to have a chicken or egg issue if you store the hash inside the file. You won't know the hash until you hash the file. But then when you hash the file and add that value to the end of the file, the hash will change. So clearly you need to hash the file without including the actual hash itself. You already said this, but I'm adding it again to clarify my next points.
The trick is that hash/sum algorithms give you the sum of the entire file (or byte stream, or whatever). They don't tend to give you a "running total" as it were. Which means you'll need to separate out the hash from the rest of the content before testing to see if it's changed. That is unless you write a custom hashing tool yourself.
This is of course possible using all hashing algorithms, but the fact that you are asking this question leads me to believe that you probably won't want the hassle of writing a custom (e.g.) SHA256 tool specifically designed to drop out when it reaches the stored hash.
To my eye, you have three choices:
Store the hash separately from your file - or at the minimum write a temporary file which does not contain the hash, and hash that. This would allow you to use a hashing tool already built into C# without any modification or fancy trickery. I know this does not exactly match your requirements as listed, but it's an option that you might consider.
You don't mention the size of the file, but if it is sufficiently small, you could simply slurp it up into memory minus the bytes of the hash, hash your in-memory data using a built-in tool, and then compare. This would again allow you to use built-in tools.
Use a custom hashing tool that purposely drops out when it reaches the end of the "interesting" data. If that's the case, I would unquestionably recommend a non-secure hashing method like CRC, simply because it will be so much easier to understand and modify the code yourself (it is much simpler code after all). You already mention that you don't need it to be secure, so this would meet your requirements.
If you decide to go with option #3, then I would suggest schlepping over to Rosetta Code to search for a CRC algorithm in C#. From there you can read your file, subtract out the bytes of the hash, send the remainder through your hashing algorithm. The algorithm listed there processes all bytes at once, but it would be trivial to turn the accumulator into a parameter so that you could send data in chunks. This would allow you to work on an arbitrarily large file in situ.
[EDIT] FWIW, I have already gone down a similar path. In my case I wrote a custom tool which allows us to incrementally copy extremely large files over the WAN. So big that we had problems getting the file to copy safely. Proper use of the tool is to remote the source server, pre-run a CRC32 check and save the sums at arbitrary intervals. Then one copies the CRC32 checks to the client side, and starts copying the file. Should the target get stopped in the middle, or possibly corrupted somehow, one can simply supply the name of the local partial, the remote source, the file containing CRC32 sums, and finally a target. The program will start copying from the local partial, and will only start copying from the remote when a partial CRC32 sum issue is found. Our problem was that a simple resume at the end of the bytes copy did not always work. Which was frustrating since it takes so long to copy. My team mates and I laughed several times that we might try USB drives and homing pigeons...
CodePudding user response:
What are you trying to protect yourself against?
Accidental change? Then your approach sounds fine. (Make sure to add handling for when the last line with the hash was deleted by accident too.)
Malicious change? Then you'd need to hash the file content plus some private key, and use a secure hashing algorithm. MD5 is good for accidental changes because it is fast, but cryptographically it is considered broken.