I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.
This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?
This is some reproducible code:
function chop(s; nc=80)
nr = ceil(Int64, length(s)/nc)
l(i) = 1 (nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
s = "A"^500000
chop(s)
It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]
Does that mean it takes long to initialize a new String
? That wouldn't really explain the super-linear run time.
I know the canonical fast way to build strings is to use IOBuffer
or the higher-level StringBuilders
package: https://github.com/davidanthoff/StringBuilders.jl
Can someone help me understand why this code above is so slow nonetheless?
Weirdly, the below is much faster, just by adding s = collect(s)
:
function chop(s; nc=80)
s = collect(s) #this line is new
nr = ceil(Int64, length(s)/nc)
l(i) = 1 (nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
Update: I profiled the code and what apparently ends up slowing it down to a lot is the call to length(s)
in the function r(i)
. Precomputing this constant speeds up a lot. Strange. Could it be that the nested functions prevent the compiler from optimizing? After all, length(s)
should by no means be a slow operation by itself.
Here's the faster code:
function columns(s; nc=80)
L = length(s) #this line is new
nr = ceil(Int64, L/nc)
l(i) = 1 (nc*(i-1))
r(i) = min(nc*i, L)
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
CodePudding user response:
String
is immutable in Julia, if you need to work with a string in this way, it's much better to make a Vector{Char}
first, to avoid repeatedly allocating new, big strings.
CodePudding user response:
You could operate on bytes
function chop2(s; nc=80)
b = transcode(UInt8, s)
nr = ceil(Int64, length(b)/nc)
l(i) = 1 (nc*(i-1))
r(i) = min(nc*i, length(b))
dat = UInt8[]
for i in 1:nr
append!(dat, @view(b[l(i):r(i)]))
i < nr && push!(dat, UInt8('\n'))
end
String(dat)
end
and the benchmarks (around 5000x faster):
@btime chop($s);
1.531 s (6267 allocations: 1.28 MiB)
julia> @btime chop2($s);
334.100 μs (13 allocations: 1.57 MiB)
Notes:
- this code could be still made slightly faster by pre-allocating
dat
but I tried to bi similar to the original. - when having unicode characters neither yours nor this approach will not work as you cannot cut a unicode character in the middle
CodePudding user response:
My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):
chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')
The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}
.