Home > Software design >  Massive Performance Problem - Using Channels in Julia
Massive Performance Problem - Using Channels in Julia

Time:08-01

Summary

Benchmarking times for Channels in Julia - using a ~5GB tsv file

  • Baseline: Bash tools (cat, grep - baseline written in C)
    • ~ 2 seconds
  • Julia: Simple loop with eachline
    • ~ 4-5 seconds (2nd-run, not pre-compilation, etc)
  • Julia Channel implementation
    • ~ 11 seconds (2nd-run, not pre-compilation, etc)

Also:

  • Pure Python
    • ~ 4-5 seconds

Longer Explaination

I have been working towards making the most performant/standard type of multiprocessing design pattern wherein data is either streamed from disk or a download stream, pieces are fed to all cores on the system, and then the output from this is serialized to disk. This is obviously a hugely important design to get right, since most programming tasks fall within this description.

Julia seems like a great choice for this due to it's supposed ability to be performant.

In order to serialize the IO to/from disk or download and then to send data to each processor, Channels seem to be the suggested choice by Julia.

However, my tests so far seem to indicate that this is extremely non-performant.

The simplest example displays just how exceedingly slow Channels (and Julia!) are at this. It's been very disappointing.

A simple example of grep and cat (removing multiprocessing bits for clarity):

Julia code:

using CodecZlib: GzipDecompressorStream
using TranscodingStreams: NoopStream

    
"""
  A simple function to "generate" (place into a Channel) lines from a file
  - This mimics python-like behavior of 'yield'
"""
function cat_ch(fpath)
  Channel() do ch
    codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
    open(codec, fpath, "r") do stream
      for (i, l) in enumerate(eachline(stream))
        put!(ch, (i, l))
      end
    end
  end
end


function grep_ch(line_chnl, searchstr)
  Channel() do ch
    for (i, l) in line_chnl
      if occursin(searchstr, l)
          put!(ch, (i, l))
      end
    end
  end
end

function catgrep_ch(fpath, search)
  for (i, l) in grep_ch(cat_ch(fpath), search)
      println((i, l))
  end
end

function catgrep(fpath, search)
  codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
  open(codec, fpath, "r") do stream
    for (i, l) in enumerate(eachline(stream))
      if occursin(search, l)
        println((i,l))
      end
    end
  end
end

if abspath(PROGRAM_FILE) == @__FILE__
  fpath = ARGS[1]
  search = ARGS[2]
  catgrep_ch(fpath, search)
end

Performance Benchmarks

1) Baseline:

user@computer>> time (cat bigfile.tsv | grep seachterm)

real    0m1.952s
user    0m0.205s
sys 0m2.525s

3) Without Channels (Simple) in Julia:

julia> include("test1.jl")

julia> @time catgrep("bigfile.tsv", "seachterm")
  4.448542 seconds (20.30 M allocations: 10.940 GiB, 5.00% gc time)

julia> @time catgrep("bigfile.tsv", "seachterm")
  4.512661 seconds (20.30 M allocations: 10.940 GiB, 4.87% gc time)

So, it's like 2-3x worse, in the most simplistic possible case. Nothing fancy is done here at all, and it isn't due to pre-compilation.

3) Channels in Julia:

julia> @time catgrep_ch("bigfile.tsv", "seachterm")
 11.691557 seconds (65.45 M allocations: 12.140 GiB, 3.06% gc time, 0.80% compilation time)

julia> @time catgrep_ch("bigfile.tsv", "seachterm")
 11.403931 seconds (65.30 M allocations: 12.132 GiB, 3.03% gc time)

This is really horrible, and I'm not sure how it becomes so sluggish.

Is the way in which Channels are used here wrong?

CodePudding user response:

Julia, grep and Python uses different algorithms when it comes to string searching. There are many algorithm and some are far better than others in specific cases.

grep is highly optimized so to run quickly in many situation including in your specific use-case. Indeed, according to the GNU documentation, the Boyer-Moore fast string searching algorithm is used to match a single fixed pattern, and the Aho-Corasick algorithm is used to match multiple fixed patterns. In your specific use-case, Boyer-Moore is select and it is generally fast since it can skip part of the input based on the searched string. Its best-case complexity is Ω(n/m) and its worst-case complexity is O(mn). It is extremely fast if the text rarely contains characters of the searched string. For example, searching seachterm in this is a test with a pretty long sentence (repeated 58.5 million times) is 10 times faster than searching iss while both are not present in the target file. This is because Boyer-Moore search for the last letter of the searched string (a m) in the text and cannot find it so it can be very fast. There are other reasons explaining why grep is so fast compared to most alternative methods. One of them is that grep do not create/allocate sub-strings for each lines and use a huge raw buffer instead. Note that cat bigfile.tsv | grep seachterm can be significantly slower than grep seachterm bigfile.tsv since the pipe introduce a significant overhead when the parsing is fast enough.

CPython uses a mix of different algorithms so be efficient in most cases. Based on the implementation, they use a mix of the Boyer-Moore algorithm "incorporating ideas of Horspool and Sunday". They claim the resulting algorithm to be faster than other algorithm like Knuth-Morris-Pratt for example. For long strings, they use an even faster algorithm that is very efficient: the Crochemore and Perrin's Two-Way algorithm (a mix of BM and KMP). This one runs in O(n m) in the worst case which is optimal. Note that while this implementation is great, splitting lines of a file and creating many string object can significantly decrease the performance. THis is certainly why your python implementation is not so fast compared to grep.

In Julia code, the file splitting in lines which introduces a significant overhead and put a pressure on the garbage-collector. Furthermore, occursin do not seems particularly optimized. There is no comment in the code about which algorithm is used. That being said, it looks like a naive generic brute-force algorithm running it O(mn) time. Such a code cannot compete with optimized implementations of efficient algorithms like the one in Python and grep.

Channels are a bit similar to coroutines and fibers (or any "light threads") with a FIFO queue so to manage messages. Such a construct introduces a significant overhead due to expensive software-defined context-switches (aka yield that mainly consists in saving/restoring some registers). The negative effect on performance can be delayed. Indeed, light threading systems have their own stack and they own code context. Thus, when the processor do a light-thread context switch, this can cause data/code cache-misses. For more information about how channels you can read the documentation about it (which mention an embedded task scheduler) or directly read the code.

In addition, channels create objects/message than needs to be managed by the garbage collector putting even more pressure on it. In fact, the number of allocation is >3 times bigger in the channel based version. One can argue that the reported GC overhead is low but such metrics often underestimate the overall overhead which include allocations, memory diffusion/fragmentation, GC collections, cache-effects, etc. (and, in this case, even I/O overlapping effects).

I think the main problem with the channel-based implementation is that the channel of your code are unbuffered (see the documentation about it). Using wide buffers can help to significantly reduce the number of context-switches and so the overhead. This may increase the latency but there is often a trade-off to make between latency and throughput (especially in scheduling). Alternatively, note that there are some packages that can be faster than built-ins channels.

CodePudding user response:

Edit (with regard to new info from @chase)

@chase as far as I understand you are comparing performance of yield in Python which is a generator for non materialized lists vs a Channel in Julia which is a FIFO queue which supports for multi-threaded insertion and polling of elements. In this case this you are comparing two very different things (like apples-to-oranges).

If your goal is the implementation of processing similar in ideas to grep have a look at the performance tips below.

Performance tips

Channel will add a big overhead like any additional communication layer. If you need performance you need to:

  1. Use either @distributed or Threads.@threads to create parallel workers
  2. Each worker opens the file for reading
  3. Use seek to allocate their location (e.g. having a 1000 byte of file and 2 workers the first one starts at byte 0 and the second does seek(500).
  4. Remember to implement mechanism in such a way that you handle situation that your worker gets data in the middle of the line
  5. Operate directly on raw bytes rather than String (for performance)
  • Related