Best way to write large text file incrementally in Swift-CodePudding

I'm writing a fairly large text file (it's actually more like ascii-encoded data), and it's... very slow. And uses a lot of memory.

Here's a minimalistic version of the code I'm using to test how to write files more quickly. writeFileIncrementally writes one line at a time in a for loop, while writeFileFromBigData creates a large string and then dumps it to disk. I was fully expecting writeFileFromBigData to be faster, but it's 20 times faster! That's a bit more than I expected. For size=10_000_000, it takes 20-25 seconds to write it incrementally and 1-1.5 seconds to write it one go. Plus, the incremental version actually allocates more and more memory as it goes. By the end of it, it's well into the GiB range. I don't understand what's going on here.

func writeFileIncrementally(toUrl url: URL, size: Int) {
    // ensure file exists and is empty
    try? "".write(to: url, atomically: true, encoding: .ascii)
    
    guard let handle = try? FileHandle(forWritingTo: url) else {return}
    
    defer {
        handle.closeFile()
    }
    
    for i in 0..<size {
        let s = "\(i)\n"
        handle.write(s.data(using: .ascii)!)
    }
}

func writeFileFromBigData(toUrl url: URL, size: Int) {
    let s = (0..<size).map{String($0)}.joined(separator: "\n")
    
    try? s.write(to: url, atomically: true, encoding: .ascii)
}

Compare that to the same thing in Python. The create-string-then-write-it is faster in Python as well. That's reasonable, but the difference in Python is it takes about 2.7 seconds to write it incrementally (about 98% user time) and about 1 second to write it in one go (including creating the string). Additionally, the incremental version has constant memory usage. It does not go up as the file is being written.

def writeFileIncrementally(path, size):
    with open(path, "w ") as f:
        for i in range(size):
            f.write(f"{i}\n")

def writeFileFromBigData(path, size):
    with open(path, "w ") as f:
        f.write("\n".join(str(i) for i in range(size)))

So my question is twofold:

Why is my writeFileIncrementally function so slow and why does it use so much memory? I was hoping to be able to write incrementally to reduce memory usage.
Is there some better approach for incrementally writing a large text file in Swift?

CodePudding user response：

For memory, see Duncan C's answer. You need an autoreleasepool. But for speed, you have a small problem and a large problem.

The small problem is this line:

    handle.write(s.data(using: .ascii)!)

Rewriting that will save about 40% of your time (from 27s to 17s in my tests):

    handle.write(Data(s.utf8))

Strings are generally stored internally in UTF8. While ASCII is a perfect subset of that, your code requires checking for anything that isn't ASCII. Using .utf8 can often just grab the internal buffer directly. It also avoids creating and unwrapping an Optional.

But 17s is still a lot more than 1-2s. That's due to your big problem.

Every call to write has to get the data all the way to the OS's file buffers. Not all the way to the disk, but still, it's an expensive operation. Unless the data is precious, you generally want to chunk it into larger blocks (4k is very common). If you do this, the write time goes down to 1.5s:

let bufferSize = 4*1024
var buffer = Data(capacity: bufferSize)
for i in 0..<size {
    autoreleasepool {
        let s = "\(i)\n"
        buffer.append(contentsOf: s.utf8)
        if buffer.count >= bufferSize {
            handle.write(buffer)
            buffer.removeAll(keepingCapacity: true)
        }
    }
}
// Write the final buffer
handle.write(buffer)

This is "pretty close" to the the "big data" function's 1.1s on my system. There's still a lot of memory allocation and cleanup going on. And in my experience, at least recently, [UInt8] is much faster than Data. I'm not sure that was always true, but all my most recent tests on Mac go that way. So, writing with the newer write(contentsOf:) interface is:

let bufferSize = 4*1024
var buffer: [UInt8] = []
buffer.reserveCapacity(bufferSize)
for i in 0..<size {
    autoreleasepool {
        let s = "\(i)\n"
        buffer.append(contentsOf: s.utf8)
        if buffer.count >= bufferSize {
            try? handle.write(contentsOf: buffer)
            buffer.removeAll(keepingCapacity: true)
        }
    }
}
// Write the final buffer
try? handle.write(contentsOf: buffer)

And that's faster than the big data function, because it doesn't have to make a Data. (830ms on my machine)

But wait, it gets better. This code doesn't need the autorelease pool, and if you remove that, I can write this file in 730ms.

let bufferSize = 4*1024
var buffer: [UInt8] = []
buffer.reserveCapacity(bufferSize)
for i in 0..<size {
    let s = "\(i)\n"
    buffer.append(contentsOf: s.utf8)
    if buffer.count >= bufferSize {
        try? handle.write(contentsOf: buffer)
        buffer.removeAll(keepingCapacity: true)
    }
}
// Write the final buffer
try? handle.write(contentsOf: buffer)

But what about Python? Why doesn't it need buffers to be fast? Because it gives you buffers by default. Your open call returns a BufferedWriter with an 8k buffer that works more or less like the above code. You'd need to write in binary mode and also pass buffering=0 to turn it off. See the docs on open for the details.

CodePudding user response：

I'm not sure why the incremental writing version is so slow.

If you're worried about memory use, though, you could make your memory footprint much smaller by wrapping your inner loop with a call to autoreleasepool():

        for i in 0..<size {
            autoreleasepool {
                let s = "\(i)\n"
                handle.write(s.data(using: .ascii)!)
                if i.isMultiple(of: 100000) {
                    print(i)
                }
            }
        }

(Internally, Swift's ARC memory management sometimes allocates temporary storage on the heap as "autoreleased", which means it sticks around in memory until the current call chain returns and the app revisits the event loop. If you have a processing loop that allocates a whole bunch of local variables they can accumulate on the heap until you finish and return. It's really only a problem if you push up against the memory limits of the device however.)

Edit:

I think this might be a case of premature optimization however. It looks to me like the max memory consumption for the "write it all at once" with 10,000,000 items is about 150 mb, which is not a problem for a device that's able to run current iOS versions. Just use the "write all at once" version and be done with it. If you need to write billions of lines at once, then write hybrid code that breaks it into chunks of 10 million at a time and appends each chunk to the file. (with the inner loop wrapped in a call to autoreleasepool(), as shown above.