Home > OS >  Why isn't it possible to remove characters from the middle of a file in place?
Why isn't it possible to remove characters from the middle of a file in place?

Time:07-05

Say I have a file with a few lines of text in it:

a
b
c
d
e

and I want to remove the b character, but without rewriting the whole file. As far as I can tell, this isn't possible. Why?


I'm trying to better understand how something like SQLite works where the entire database is contained in one "file", yet obviously not all operations are appending to the file. This is my current understanding of the limitations:

  • You can append data to a file without rewriting the whole file
  • You can overwrite data in the middle of a file as long as you're not changing the number of bytes in the file
  • You can't arbitrarily remove bytes from a file without rewriting the whole file

Why do these limitations exist? Is it the filesystem/OS? Are there other platforms where this is possible? If SQLite is a useful example, answers within the context of SQLite (and how it deals with these limitations) would be great!


Reading material that led me to this point:

https://www.sqlite.org/fileformat.html

Why does SQLite store hundreds of null bytes?

C# overwrite first few lines of a text file with constant time

Replace sequence of bytes in binary file

CodePudding user response:

At the operating system level a disk - SSD or spinning rust - is broken into logical blocks. Windows and Linux both use 4k by default. The O/S does not handle anything smaller than a block - a 1 byte file takes (at least) 1 block. Most O/S's can adjust the block size when formatting the disk or partition.

To your direct points:

  • You may need to rewrite the entire file to append if your file is smaller than a block. You are likely to still have to write the last block when appending if you have a bigger file.
  • If your file is larger than a single block then you could rewrite data in the middle. If it is the same size then that is likely a single block write. But if it changes size then there may be some rearranging of blocks.
  • Depending on where in the file you remove blocks then it's possible that only a single block needs to be rewritten. Remove a byte at the beginning and it's possible the whole file will be rewritten.

Some databases have historically handled this by using "raw" partitions. I've used raw partitions (an unformatted partition on the disk) for a Sybase database in the past. Basically this means that the database gets to decide on the correct block size to an extent. The underlying hardware may restrict the size of the blocks either up or down.

If you're not using raw partitions then you're dependent on the O/S to do the I/O for you. It should be fast enough that you don't care about the blocks. At the lowest level the DB has it's internal structure that it maintains for the data and indices. Each DB vendor does this their own way and usually it can be tweaked for your use case. If you have many small rows then it might makes sense to format your disk for smaller blocks to allow for fewer bytes to be transferred with updates or inserts. Conversely, large amounts of data may benefit from 8k or 16k block sizes.

  • Related