Summary
Benchmarking times for Channels in Julia - using a ~5GB tsv file
- Baseline: Bash tools (cat, grep - baseline written in C)
- ~ 2 seconds
- Julia: Simple loop with eachline
- ~ 4-5 seconds (2nd-run, not pre-compilation, etc)
- Julia Channel implementation
- ~ 11 seconds (2nd-run, not pre-compilation, etc)
Also:
- Pure Python
- ~ 4-5 seconds
Longer Explaination
I have been working towards making the most performant/standard type of multiprocessing design pattern wherein data is either streamed from disk or a download stream, pieces are fed to all cores on the system, and then the output from this is serialized to disk. This is obviously a hugely important design to get right, since most programming tasks fall within this description.
Julia seems like a great choice for this due to it's supposed ability to be performant.
In order to serialize the IO to/from disk or download and then to send data to each processor, Channels seem to be the suggested choice by Julia.
However, my tests so far seem to indicate that this is extremely non-performant.
The simplest example displays just how exceedingly slow Channels (and Julia!) are at this. It's been very disappointing.
A simple example of grep and cat (removing multiprocessing bits for clarity):
Julia code:
using CodecZlib: GzipDecompressorStream
using TranscodingStreams: NoopStream
"""
A simple function to "generate" (place into a Channel) lines from a file
- This mimics python-like behavior of 'yield'
"""
function cat_ch(fpath)
Channel() do ch
codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
open(codec, fpath, "r") do stream
for (i, l) in enumerate(eachline(stream))
put!(ch, (i, l))
end
end
end
end
function grep_ch(line_chnl, searchstr)
Channel() do ch
for (i, l) in line_chnl
if occursin(searchstr, l)
put!(ch, (i, l))
end
end
end
end
function catgrep_ch(fpath, search)
for (i, l) in grep_ch(cat_ch(fpath), search)
println((i, l))
end
end
function catgrep(fpath, search)
codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
open(codec, fpath, "r") do stream
for (i, l) in enumerate(eachline(stream))
if occursin(search, l)
println((i,l))
end
end
end
end
if abspath(PROGRAM_FILE) == @__FILE__
fpath = ARGS[1]
search = ARGS[2]
catgrep_ch(fpath, search)
end
Performance Benchmarks
1) Baseline:
user@computer>> time (cat bigfile.tsv | grep seachterm)
real 0m1.952s
user 0m0.205s
sys 0m2.525s
3) Without Channels (Simple) in Julia:
julia> include("test1.jl")
julia> @time catgrep("bigfile.tsv", "seachterm")
4.448542 seconds (20.30 M allocations: 10.940 GiB, 5.00% gc time)
julia> @time catgrep("bigfile.tsv", "seachterm")
4.512661 seconds (20.30 M allocations: 10.940 GiB, 4.87% gc time)
So, it's like 2-3x worse, in the most simplistic possible case. Nothing fancy is done here at all, and it isn't due to pre-compilation.
3) Channels in Julia:
julia> @time catgrep_ch("bigfile.tsv", "seachterm")
11.691557 seconds (65.45 M allocations: 12.140 GiB, 3.06% gc time, 0.80% compilation time)
julia> @time catgrep_ch("bigfile.tsv", "seachterm")
11.403931 seconds (65.30 M allocations: 12.132 GiB, 3.03% gc time)
This is really horrible, and I'm not sure how it becomes so sluggish.
Is the way in which Channels are used here wrong?
CodePudding user response:
Julia, grep and Python uses different algorithms when it comes to string searching. There are many algorithm and some are far better than others in specific cases.
grep is highly optimized so to run quickly in many situation including in your specific use-case. Indeed, according to the GNU documentation, the Boyer-Moore fast string searching algorithm is used to match a single fixed pattern, and the Aho-Corasick algorithm is used to match multiple fixed patterns. In your specific use-case, Boyer-Moore is select and it is generally fast since it can skip part of the input based on the searched string. Its best-case complexity is Ω(n/m)
and its worst-case complexity is O(mn)
. It is extremely fast if the text rarely contains characters of the searched string. For example, searching seachterm
in this is a test with a pretty long sentence
(repeated 58.5 million times) is 10 times faster than searching iss
while both are not present in the target file. This is because Boyer-Moore search for the last letter of the searched string (a m
) in the text and cannot find it so it can be very fast. There are other reasons explaining why grep is so fast compared to most alternative methods. One of them is that grep do not create/allocate sub-strings for each lines and use a huge raw buffer instead. Note that cat bigfile.tsv | grep seachterm
can be significantly slower than grep seachterm bigfile.tsv
since the pipe introduce a significant overhead when the parsing is fast enough.
CPython uses a mix of different algorithms so be efficient in most cases. Based on the implementation, they use a mix of the Boyer-Moore algorithm "incorporating ideas of Horspool and Sunday". They claim the resulting algorithm to be faster than other algorithm like Knuth-Morris-Pratt for example. For long strings, they use an even faster algorithm that is very efficient: the Crochemore and Perrin's Two-Way algorithm (a mix of BM and KMP). This one runs in O(n m)
in the worst case which is optimal. Note that while this implementation is great, splitting lines of a file and creating many string object can significantly decrease the performance. THis is certainly why your python implementation is not so fast compared to grep.
In Julia code, the file splitting in lines which introduces a significant overhead and put a pressure on the garbage-collector. Furthermore, occursin
do not seems particularly optimized. There is no comment in the code about which algorithm is used. That being said, it looks like a naive generic brute-force algorithm running it O(mn)
time. Such a code cannot compete with optimized implementations of efficient algorithms like the one in Python and grep.
Channels are a bit similar to coroutines and fibers (or any "light threads") with a FIFO queue so to manage messages. Such a construct introduces a significant overhead due to expensive software-defined context-switches (aka yield
that mainly consists in saving/restoring some registers). The negative effect on performance can be delayed. Indeed, light threading systems have their own stack and they own code context. Thus, when the processor do a light-thread context switch, this can cause data/code cache-misses. For more information about how channels you can read the documentation about it (which mention an embedded task scheduler) or directly read the code.
In addition, channels create objects/message than needs to be managed by the garbage collector putting even more pressure on it. In fact, the number of allocation is >3 times bigger in the channel based version. One can argue that the reported GC overhead is low but such metrics often underestimate the overall overhead which include allocations, memory diffusion/fragmentation, GC collections, cache-effects, etc. (and, in this case, even I/O overlapping effects).
I think the main problem with the channel-based implementation is that the channel of your code are unbuffered (see the documentation about it). Using wide buffers can help to significantly reduce the number of context-switches and so the overhead. This may increase the latency but there is often a trade-off to make between latency and throughput (especially in scheduling). Alternatively, note that there are some packages that can be faster than built-ins channels.
CodePudding user response:
Edit (with regard to new info from @chase)
@chase as far as I understand you are comparing performance of yield
in Python which is a generator for non materialized lists vs a Channel
in Julia which is a FIFO queue which supports for multi-threaded insertion and polling of elements. In this case this you are comparing two very different things (like apples-to-oranges).
If your goal is the implementation of processing similar in ideas to grep have a look at the performance tips below.
Performance tips
Channel will add a big overhead like any additional communication layer. If you need performance you need to:
- Use either
@distributed
orThreads.@threads
to create parallel workers - Each worker opens the file for reading
- Use
seek
to allocate their location (e.g. having a 1000 byte of file and 2 workers the first one starts at byte 0 and the second doesseek(500)
. - Remember to implement mechanism in such a way that you handle situation that your worker gets data in the middle of the line
- Operate directly on raw bytes rather than
String
(for performance)