Home > Software engineering >  How to compress data
How to compress data

Time:02-20

I'm trying to compress data to improve the space complexity, but I'm not sure if I'm incorrectly compressing data or incorrectly measuring the size.

I tried the following in the Playground.

import Foundation
import Compression

// Example data
struct MyData: Encodable {
    let property = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
}

// I tried using MemoryLayout to measure the size of the uncompressed data
let size = MemoryLayout<MyData>.size
print("myData type size", size) // 16

let myData = MyData()
let myDataSize = MemoryLayout.size(ofValue: myData)
print("myData instance size", myDataSize) // 16

func run() {
    // 1. This shows the size of the encoded data
    guard let encoded = try? JSONEncoder().encode(myData) else { return }
    print("myData encoded size", encoded) // 589 bytes

    /// 2. This shows the size after using a first compression method
    guard let compressed = try? (encoded as NSData).compressed(using: .lzfse) else { return }
    let firstCompression = Data(compressed)
    print("firstCompression", firstCompression) // 491 bytes

    /// 3. Second compression method (just wanted to try a different compression method)
    let secondCompression = compress(encoded)
    print("secondCompression", secondCompression) // 491 bytes
    
    /// 4. Wanted to compare the size difference between compressed and uncompressed for a bigger data so here is the array of uncompressed data.
    var myDataArray = [MyData]()
    for _ in 0 ... 100 {
        myDataArray.append(MyData())
    }
    guard let encodedArray = try? JSONEncoder().encode(myDataArray) else { return }
    print("myData encodedArray size", encodedArray) // 59591 bytes
    print("memory layout", MemoryLayout.size(ofValue: encodedArray)) // 16
    
    /// 5. Compressed array
    var compressedArray = [Data]()
    for _ in 0 ... 100 {
        guard let compressed = try? (encoded as NSData).compressed(using: .lzfse) else { return }
        let data = Data(compressed)
        compressedArray.append(data)
    }
    guard let encodedCompressedArray = try? JSONEncoder().encode(compressedArray) else { return }
    print("myData compressed array size", encodedCompressedArray) // 66661 bytes
    print("memory layout", MemoryLayout.size(ofValue: encodedCompressedArray)) // 16

    /// 6. Compression using lzma
    var differentCompressionArray = [Data]()
    for _ in 0 ... 100 {
        guard let compressed = try? (encoded as NSData).compressed(using: .lzma) else { return }
        let data = Data(compressed)
        differentCompressionArray.append(data)
    }
    guard let encodedCompressedArray2 = try? JSONEncoder().encode(differentCompressionArray) else { return }
    print("myData compressed array size", encodedCompressedArray2) // 60702 bytes
    print("memory layout", MemoryLayout.size(ofValue: encodedCompressedArray2)) // 16
}

run()

// The implementation for the second compression method
func compress(_ sourceData: Data) -> Data {
    let pageSize = 128
    var compressedData = Data()
    
    do {
        let outputFilter = try OutputFilter(.compress, using: .lzfse) { (data: Data?) -> Void in
            if let data = data {
                compressedData.append(data)
            }
        }
        
        var index = 0
        let bufferSize = sourceData.count
        
        while true {
            let rangeLength = min(pageSize, bufferSize - index)
            
            let subdata = sourceData.subdata(in: index ..< index   rangeLength)
            index  = rangeLength
            
            try outputFilter.write(subdata)
            
            if (rangeLength == 0) {
                break
            }
        }
    }catch {
        fatalError("Error occurred during encoding: \(error.localizedDescription).")
    }
    
    return compressedData
}

The MemoryLayout object doesn't seem to be helpful in measuring the size of encoded arrays whether or not they're compressed. I'm not sure how to measure a struct or an array of struts without encoding them with JSONEncoder which already compresses the data.

The before/after compression for the single instance of MyData (#1, #2, and #3) seems to show that the data is being properly compressed going from 589 bytes to 491 bytes. However, the comparison between an array of uncompressed data and an array of compressed data (#4, #5) seems to show that the size increased from 59591 to 66661 after the compression.

Finally, I tried using a different compression algorithm lzma (#6). It reduced the size to 60702 which is lower than the previous compression, but it still wasn't smaller than the uncompressed data.

CodePudding user response:

To get a bit of confusion out of the way first: MemoryLayout gives you information about the size and structure of the layout of a type at compile time, but can't be used to determine the amount of storage an Array value needs at runtime because the size of the Array structure itself does not depend on how much data it contains.

Highly simplified, the layout of an Array value looks like this:

┌─────────────────────┐                        
│        Array        │                        
├──────────┬──────────┤    ┌──────────────────┐
│  length  │  buffer ─┼───▶│     storage      │
└──────────┴──────────┘    └──────────────────┘
  1 word /   1 word /                          
  8 bytes    8 bytes                           
 └─────────┬─────────┘
           └─▶ MemoryLayout<Array<UInt8>>.size

An Array value stores its length, or count (mixed in with some flags, but we don't need to worry about that) and a pointer to the actual space where the items it contains are stored. Those items aren't stored as part of the Array value itself, but separately in allocated memory which the Array points to. Whether the Array "contains" 10 values or 100000 values, the size of the Array structure remains the same: 1 word (or 8 bytes on a 64-bit system) for the length, and 1 word for the pointer to the actual underlying storage. (The size of the storage buffer, however, is exactly determined by the number of elements it is able to contain, at runtime.)

In practice, Array is significantly more complicated than this for bridging and other reasons, but this is the basic gist; this is why you only ever see MemoryLayout.size(ofValue:) return the same number every time. [And incidentally, the size of String is the same as Array for similar reasons, which is why MemoryLayout<MyData>.size also reports 16.]

In order to know how many bytes an Array or a Data effectively take up, it's sufficient to ask them for their .count: Array<UInt8> and Data are both collections of UInt8 values (bytes), and their .count will reflect the amount of data effectively stored in their underlying storage.


As for the size increase between step (4) and (5), note that

  • Step 4 takes 100 copies of your MyData and joins them together before converting them to JSON, while
  • Step 5 takes 100 copies of individually compressed MyData instances, joins those together, and then re-coverts them to JSON

Step 5 has a few issues compared to step 4:

  1. Compression benefits heavily from repetition in data: a bit of data compressed and repeated 100 times won't be nearly as compact as a bit of data repeated 100 times, then compressed, because each round of compression can't benefit from knowing that there's another copy of the data that came before it. As a simple example:
    • Let's say we wanted to use a form of run-length encoding to compress the string Hello: there isn't a lot we can do, except maybe turn it into Hel{2}o (where {2} indicates a repetition of the last character 2 times)
    • If we compress Hello and join it 3 times, we get might get Hel{2}oHel{2}oHel{2}o,
    • But if we first joined Hello 3 times and then compressed, we could get {Hel{2}o}{3}, which is much more compact
  2. Compression also typically needs to insert some information about how the data was compressed in order to be able to recognize and decompress the data later. By compressing MyData 100 times and joining all of those instances, you're repeating that metadata 100 times
  3. Even after compressing your MyData instances, re-representing them as JSON decreases how compressed they are because it can't represent the binary data exactly. Instead, it has to convert each Data blob into a Base64-encoded string, which causes it to grow again

Between these issues, it's not terribly surprising that your data is growing. What you actually want is a modification to step 4, which is compressing the joined data:

guard let encodedArray = try? JSONEncoder().encode(myDataArray) else { fatalError() }
guard let compressedEncodedArray = try? (encodedArray as NSData).compressed(using: .lzma) else { fatalError() }
print(compressedEncodedArray.count) // => 520

This is significantly better than

guard let encodedCompressedArray = try? JSONEncoder().encode(compressedArray) else { fatalError() }
print(encodedCompressedArray.count) // => 66661

As an aside: it seems unlikely that you're actually using JSONEncoder in practice to join data in this way, and this was just for measurement here — but if you actually are, consider other mechanisms for doing this. Converting binary data to JSON in this way is very inefficient storage-wise, and with a bit more information about what you might actually need in practice, we might be able to recommend a more effective way to do this.

If what you're actually doing in practice is encoding an Encodable object tree and then compressing that the one time, that's totally fine.

  • Related