Home > Back-end >  I need to track changes to files, but I cannot think of a way
I need to track changes to files, but I cannot think of a way

Time:04-06

I have an idea that I am working on. I have a windows mini-filter driver that I am trying to create that will virtualize changes to files by certain processes. I am doing this by capturing the writes, and sending the writes to a file that is in a virtualized location. Here is the issue: If the process tries to read, it needs to get unaltered reads for parts of the file it has not written to, but it needs to get the altered reads from parts that have been written to. How do I track the segments of the file that have been altered in an efficient way? I seem to remember a way you can use a bitmask to map file segments, but I may be misremembering. Anyway any help would be greatly appreciated.

CodePudding user response:

Two solutions:

  1. Simply copy the original file to virtualized storage, and use only this file. For small files, it will probably be the best and fastest solution. To give an example, let's say that any file smaller than 65536 bytes would be fully copied - use a power of two in any case. If file is growing above limit, see solution 2.

  2. For big files, keep overwritten segments in virtualized storage, use them according to current file position when needed. Easiest way will be to split it in 65536 bytes chunks... You get the chunk number by shifting file's position by 16 to the right, and the position within the chunk is obtained by masking only the lower 16 bits.

Example:

file_position = 165 232 360
chunk_number = file_position >> 16 (== 2 521)
chunk_pos = file_position & 0xFFFF (== 16 104)

So, your virtualized storage become a directory, storing chunk named trivially (chunk #2521 = 2521.chunk, for example).

When a write occurs, you start to copy the original data to a new chunk in virtualized storage, then you allow application to write inside.

Obviously, if file is growing, simply add chunks that will exist only in virtualized storage.


It's not perfect - you can use delta chunks instead of full ones, to save disk space - but it's a good start that can be optimized later. Also, it's quite easy to add versions, and keep trace of:

  • Various applications that use the file (keep multiple virtualized storages),
  • Successive launches (run #1 modifies start of file, run #2 modifies end of file, you keep both virtualizations and you can easily "revert" the last launch).
  • Related