Why is string creation so slow in Julia?-CodePudding

I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.

This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?

This is some reproducible code:

function chop(s; nc=80)
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1 (nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end

s = "A"^500000

chop(s)

It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]

Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.

I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl

Can someone help me understand why this code above is so slow nonetheless?

Weirdly, the below is much faster, just by adding s = collect(s):

function chop(s; nc=80)
    s = collect(s) #this line is new
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1 (nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end

Update: I profiled the code and what apparently ends up slowing it down to a lot is the call to length(s) in the function r(i). Precomputing this constant speeds up a lot. Strange. Could it be that the nested functions prevent the compiler from optimizing? After all, length(s) should by no means be a slow operation by itself.

Here's the faster code:

function columns(s; nc=80)
    L = length(s) #this line is new
    nr   = ceil(Int64, L/nc)
    l(i) = 1 (nc*(i-1)) 
    r(i) = min(nc*i, L)
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end

CodePudding user response：

String is immutable in Julia, if you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.

CodePudding user response：

You could operate on bytes

function chop2(s; nc=80)
    b = transcode(UInt8, s)
    nr   = ceil(Int64, length(b)/nc)
    l(i) = 1 (nc*(i-1)) 
    r(i) = min(nc*i, length(b))
    dat = UInt8[]
    for i in 1:nr
        append!(dat, @view(b[l(i):r(i)]))
        i < nr && push!(dat, UInt8('\n'))
    end
    String(dat)
end

and the benchmarks (around 5000x faster):

 @btime chop($s);
  1.531 s (6267 allocations: 1.28 MiB)

julia> @btime chop2($s);
  334.100 μs (13 allocations: 1.57 MiB)

Notes:

this code could be still made slightly faster by pre-allocating dat but I tried to bi similar to the original.
when having unicode characters neither yours nor this approach will not work as you cannot cut a unicode character in the middle

CodePudding user response：

My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):

chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
    join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')

The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.