I'm trying to compress data to improve the space complexity, but I'm not sure if I'm incorrectly compressing data or incorrectly measuring the size.
I tried the following in the Playground.
import Foundation
import Compression
// Example data
struct MyData: Encodable {
let property = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
}
// I tried using MemoryLayout to measure the size of the uncompressed data
let size = MemoryLayout<MyData>.size
print("myData type size", size) // 16
let myData = MyData()
let myDataSize = MemoryLayout.size(ofValue: myData)
print("myData instance size", myDataSize) // 16
func run() {
// 1. This shows the size of the encoded data
guard let encoded = try? JSONEncoder().encode(myData) else { return }
print("myData encoded size", encoded) // 589 bytes
/// 2. This shows the size after using a first compression method
guard let compressed = try? (encoded as NSData).compressed(using: .lzfse) else { return }
let firstCompression = Data(compressed)
print("firstCompression", firstCompression) // 491 bytes
/// 3. Second compression method (just wanted to try a different compression method)
let secondCompression = compress(encoded)
print("secondCompression", secondCompression) // 491 bytes
/// 4. Wanted to compare the size difference between compressed and uncompressed for a bigger data so here is the array of uncompressed data.
var myDataArray = [MyData]()
for _ in 0 ... 100 {
myDataArray.append(MyData())
}
guard let encodedArray = try? JSONEncoder().encode(myDataArray) else { return }
print("myData encodedArray size", encodedArray) // 59591 bytes
print("memory layout", MemoryLayout.size(ofValue: encodedArray)) // 16
/// 5. Compressed array
var compressedArray = [Data]()
for _ in 0 ... 100 {
guard let compressed = try? (encoded as NSData).compressed(using: .lzfse) else { return }
let data = Data(compressed)
compressedArray.append(data)
}
guard let encodedCompressedArray = try? JSONEncoder().encode(compressedArray) else { return }
print("myData compressed array size", encodedCompressedArray) // 66661 bytes
print("memory layout", MemoryLayout.size(ofValue: encodedCompressedArray)) // 16
/// 6. Compression using lzma
var differentCompressionArray = [Data]()
for _ in 0 ... 100 {
guard let compressed = try? (encoded as NSData).compressed(using: .lzma) else { return }
let data = Data(compressed)
differentCompressionArray.append(data)
}
guard let encodedCompressedArray2 = try? JSONEncoder().encode(differentCompressionArray) else { return }
print("myData compressed array size", encodedCompressedArray2) // 60702 bytes
print("memory layout", MemoryLayout.size(ofValue: encodedCompressedArray2)) // 16
}
run()
// The implementation for the second compression method
func compress(_ sourceData: Data) -> Data {
let pageSize = 128
var compressedData = Data()
do {
let outputFilter = try OutputFilter(.compress, using: .lzfse) { (data: Data?) -> Void in
if let data = data {
compressedData.append(data)
}
}
var index = 0
let bufferSize = sourceData.count
while true {
let rangeLength = min(pageSize, bufferSize - index)
let subdata = sourceData.subdata(in: index ..< index rangeLength)
index = rangeLength
try outputFilter.write(subdata)
if (rangeLength == 0) {
break
}
}
}catch {
fatalError("Error occurred during encoding: \(error.localizedDescription).")
}
return compressedData
}
The MemoryLayout
object doesn't seem to be helpful in measuring the size of encoded arrays whether or not they're compressed. I'm not sure how to measure a struct or an array of struts without encoding them with JSONEncoder
which already compresses the data.
The before/after compression for the single instance of MyData
(#1, #2, and #3) seems to show that the data is being properly compressed going from 589 bytes to 491 bytes. However, the comparison between an array of uncompressed data and an array of compressed data (#4, #5) seems to show that the size increased from 59591 to 66661 after the compression.
Finally, I tried using a different compression algorithm lzma
(#6). It reduced the size to 60702 which is lower than the previous compression, but it still wasn't smaller than the uncompressed data.
CodePudding user response:
To get a bit of confusion out of the way first: MemoryLayout
gives you information about the size and structure of the layout of a type at compile time, but can't be used to determine the amount of storage an Array
value needs at runtime because the size of the Array
structure itself does not depend on how much data it contains.
Highly simplified, the layout of an Array
value looks like this:
┌─────────────────────┐
│ Array │
├──────────┬──────────┤ ┌──────────────────┐
│ length │ buffer ─┼───▶│ storage │
└──────────┴──────────┘ └──────────────────┘
1 word / 1 word /
8 bytes 8 bytes
└─────────┬─────────┘
└─▶ MemoryLayout<Array<UInt8>>.size
An Array
value stores its length, or count
(mixed in with some flags, but we don't need to worry about that) and a pointer to the actual space where the items it contains are stored. Those items aren't stored as part of the Array
value itself, but separately in allocated memory which the Array
points to. Whether the Array
"contains" 10 values or 100000 values, the size of the Array
structure remains the same: 1 word (or 8 bytes on a 64-bit system) for the length, and 1 word for the pointer to the actual underlying storage. (The size of the storage buffer, however, is exactly determined by the number of elements it is able to contain, at runtime.)
In practice, Array
is significantly more complicated than this for bridging and other reasons, but this is the basic gist; this is why you only ever see MemoryLayout.size(ofValue:)
return the same number every time. [And incidentally, the size of String
is the same as Array
for similar reasons, which is why MemoryLayout<MyData>.size
also reports 16
.]
In order to know how many bytes an Array
or a Data
effectively take up, it's sufficient to ask them for their .count
: Array<UInt8>
and Data
are both collections of UInt8
values (bytes), and their .count
will reflect the amount of data effectively stored in their underlying storage.
As for the size increase between step (4) and (5), note that
- Step 4 takes 100 copies of your
MyData
and joins them together before converting them to JSON, while - Step 5 takes 100 copies of individually compressed
MyData
instances, joins those together, and then re-coverts them to JSON
Step 5 has a few issues compared to step 4:
- Compression benefits heavily from repetition in data: a bit of data compressed and repeated 100 times won't be nearly as compact as a bit of data repeated 100 times, then compressed, because each round of compression can't benefit from knowing that there's another copy of the data that came before it. As a simple example:
- Let's say we wanted to use a form of run-length encoding to compress the string
Hello
: there isn't a lot we can do, except maybe turn it intoHel{2}o
(where{2}
indicates a repetition of the last character2
times) - If we compress
Hello
and join it 3 times, we get might getHel{2}oHel{2}oHel{2}o
, - But if we first joined
Hello
3 times and then compressed, we could get{Hel{2}o}{3}
, which is much more compact
- Let's say we wanted to use a form of run-length encoding to compress the string
- Compression also typically needs to insert some information about how the data was compressed in order to be able to recognize and decompress the data later. By compressing
MyData
100 times and joining all of those instances, you're repeating that metadata 100 times - Even after compressing your
MyData
instances, re-representing them as JSON decreases how compressed they are because it can't represent the binary data exactly. Instead, it has to convert eachData
blob into a Base64-encoded string, which causes it to grow again
Between these issues, it's not terribly surprising that your data is growing. What you actually want is a modification to step 4, which is compressing the joined data:
guard let encodedArray = try? JSONEncoder().encode(myDataArray) else { fatalError() }
guard let compressedEncodedArray = try? (encodedArray as NSData).compressed(using: .lzma) else { fatalError() }
print(compressedEncodedArray.count) // => 520
This is significantly better than
guard let encodedCompressedArray = try? JSONEncoder().encode(compressedArray) else { fatalError() }
print(encodedCompressedArray.count) // => 66661
As an aside: it seems unlikely that you're actually using JSONEncoder
in practice to join data in this way, and this was just for measurement here — but if you actually are, consider other mechanisms for doing this. Converting binary data to JSON in this way is very inefficient storage-wise, and with a bit more information about what you might actually need in practice, we might be able to recommend a more effective way to do this.
If what you're actually doing in practice is encoding an Encodable
object tree and then compressing that the one time, that's totally fine.