I was working on a Huffman project to compress text files. I was able to generate the required codes. I read the whole file and accordingly stored the codes in a "vector char" variable. I also padded the encoded vector.
vector<char> padding(vector<char> text)
{
int num = text.size();
unsigned int pad_value = 32-(num2);
for(int i=0;i<pad_value;i ){
text.push_back('0');
}
string pad_info = bitset<32>(pad_value).to_string();
for(int i=pad_info.length()-1;i>=0;i--){
text.insert(text.begin(),pad_info[i]);
}
return text;
}
I padded on the base of 32 bits, as I was thinking if using an array of "unsigned int" to directly store the integers in a binary file so that they occupy 4 bytes for every 32 characters. I used this function for that:
vector<unsigned int> build_byte_array(vector<char> padded_text)
{
vector<unsigned int> byte_arr;
for(int i=0;i<padded_text.size();i =32)
{
string byte="";
for(int j=i;j<i 32;j ){
byte = padded_text[j];
}
unsigned int b = stoul(byte,nullptr,2);
//cout<<b<<":"<<byte<<endl;
byte_arr.push_back(b);
}
return byte_arr;
}
Now the problem is when I write this byte array to binary file using
ofstream output("compressed.bin",ios::binary);
for(int i=0;i<byte_array.size();i ){
unsigned int a = byte_array[i];
output.write((char*)(&a),sizeof(a));
}
I get a binary file which is bigger than the original text file. How do I solve that or what error am I making.
Edit : I tried to compress a file of about 2,493 KB (for testing purposes) and it generated a compressed.bin file of 3,431 KB. So, I don't think padding is the issue here. I also tried with 15KB file but the size of always increases after using this algo.
I tried using:
for(int i=0;i<byte_array.size();i ){
unsigned int a = byte_array[i];
char b = (char)a;
output.write((char*)(&a),sizeof(b));
}
but after using this I am unable to recover the original byte array when decompressing the file.
CodePudding user response:
unsigned int a = byte_array[i];
output.write((char*)(&a),sizeof(a));
The size of the write is sizeof(a)
which is usually 4 bytes.
An unsigned int
is not a byte. A more suitable type for a byte would be std::byte
, uint8_t
, or unsigned char
.
CodePudding user response:
You are expanding your data with padding, so if you're not getting much compression or there's not much data to begin with, the output could easily be larger.
You don't need to pad nearly as much as you do. First off, you are adding 32 bits when the data already ends on a word boundary (when num
is a multiple of 32). Pad zero bits in that case. Second, you are inserting 32 bits at the start to record how many bits you padded, where five bits would suffice to encode 0..31. Third, you could write bytes instead of int
s, so the padding on the end could be 0..7 bits, and you could prepend three bits instead of five. The padding overall could be reduced from your current 33..64 bits to 3..10 bits.