Home > Enterprise >  Why most files like jpeg or pdf don't use just ASCII characters for encoding?
Why most files like jpeg or pdf don't use just ASCII characters for encoding?

Time:02-01

Whenever we try to open jpeg or pdf file with any text editor we find strange symbols other than ASCII. Isn't Ascii most efficient because of less space consumption by limited number of possible characters available.

File opened in Terminal

I was working with a database file in linux with plocate and I found something similar.

CodePudding user response:

Isn't Ascii most efficient because of less space consumption by limited number of possible characters available.

Not at all. Where did you get that idea from?

ASCII chars are 7bits long, but hardware doesn't support storing 7bits items, so ASCII is stored with 8bits, the first bit being always 0. Furthermore, ASCII includes a number of control characters that can cause issues in some situation. Therefore, the most prominent ASCII encoding (base 64) uses only 6bits. This mean that in order to encode 3 bytes (38 = 24 bits) of data you need 4 ASCII characters (4 6 = 24). Those 4 ASCII characters are then stored using 4 bytes on disk. Hence, converting a file to ASCII increases disk usage by 33%.

You can test this with the base64 command:

base64 pic.jpg > b64_jpeg.txt
ls -lh pic.jpg b64_jpeg.txt

Of course, you could try to use another ASCII encoding than the standard base64 and use all 7 bits available in ASCII. You would still get only 7bits of data per bytes on disk, thus have a 14% disk usage increase for the same data.

CodePudding user response:

All modern storage uses 8-bit bytes. ASCII is an obsolete 7 bits standard, so it would take 8/7th as much storage ( 14%).

  •  Tags:  
  • file
  • Related