Home > Software engineering >  Julia pairwise broadcast
Julia pairwise broadcast

Time:09-03

I would like to compare every pair of strings in a list of strings in Julia. One way to do it is

equal_strs = [(x == y) for x in str_list, y in str_list]

However, if I use broadcast as follows:

equal_strs = broadcast(==, str_list, str_list)

it returns a vector instead of a 2D array. Is there a way to output a 2D array using broadcast?

CodePudding user response:

Broadcasting works by expanding ("broadcasting") dimensions that do not have the same length, in a way such that an array with (for example) size Nx1xM broadcasted with a NxKx1 gives an NxKxM array.

This means that if you broadcast an operation with to length N vectors, you will get a length N vector.

So you need one string array to be a length N vector, and the other an 1xM matrix:

julia> using Random

julia> str1 = [randstring('A':'C', 3) for _ in 1:5]
5-element Vector{String}:
 "ACC"
 "CBC"
 "AAC"
 "CAB"
 "BAB"

1.8.0> str2 = [randstring('A':'C', 3) for _ in 1:4]
4-element Vector{String}:
 "ABB"
 "BAB"
 "CAA"
 "BBC"

1.8.0> str1 .== permutedims(str2)
5×4 BitMatrix:
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  1  0  0

permutedims will change a length N vector into a 1xN matrix.

BTW, you would very rarely use broadcast in your code (broadcast(==, a, b)), instead, use the dot syntax, a .== b, which is more idiomatic.

CodePudding user response:

You should have one vector transposed for the broadcasting machinery to build a matrix by expanding dimensions of the inputs to agree.

julia> str_list = ["hello", "car", "me", "good", "people", "good"];

julia> equal_strs = broadcast(==, str_list, permutedims(str_list))
6×6 BitMatrix:
 1  0  0  0  0  0
 0  1  0  0  0  0
 0  0  1  0  0  0
 0  0  0  1  0  1
 0  0  0  0  1  0
 0  0  0  1  0  1

Also, the following are similar.

equal_strs = str_list .== permutedims(str_list)
equal_strs = isequal.(str_list, permutedims(str_list))

CodePudding user response:

Will assume that by "list" you mean a Vector as there are no python-like lists in Julia. If you meant a tuple I would suggest converting it into a Vector anyway because broadcasting is best used with Arrays (which Vector is a subtype of).

str_list = ["one", "two", "three", "one", "two"]

Now you simply do

broadcast(==, str_list, permutedims(str_list))

or more concise with the dot operator

str_list .== permutedims(str_list)

What happens under the hood:

broadcasting in Julia works element-wise, so if you have 2 Vectors it will not do anything as the dimensions match.

But if you have a Vector and a Matrix (Vector is a 1D Array, and Matrix is a 2D array) with the shapes of (N,1) and (1,N) Julia will broadcast the 1 dimension giving you a Matrix of shape (N,N) which is what you want.

Now usually with numbers you would do ' instead of permutedims

num_list .== num_list'

as to why it doesn't work with strings see this answer.

CodePudding user response:

lst .== permutedims(lst) is a perfectly good method to find the result, as suggested by other answers. But it takes O(n^2) comparisons, and if the list is long, it might be better to use an O(n*log(n)) comparisons algorithm. Following is an implementation of such as algorithm with a little benchmark:

function equal_str(lst)
    sp = sortperm(lst)
    isp = invperm(sp)
    same = [i==1 ? false : lst[sp[i]]==lst[sp[i-1]] for i=1:length(lst)]
    ac = accumulate((k,v)-> ifelse(v==false, k 1, k), same; init=0)
    return [ ac[isp[i]]==ac[isp[j]] for i=1:length(lst),j=1:length(lst) ]
end

and the benchmark gives:

julia> using Random

julia> using BenchmarkTools

julia> lst = [randstring('A':'C',3) for i=1:40];

julia> show(lst)
["CBA", "CAB", "BCA", "AAC", "AAA", "ABC", "BBA", "CAB", "CBC", "CCA",
 "BCC", "BCB", "CAB", "BCB", "ACC", "CBC", "CCC", "CCB", "BCB", "BCB", 
 "ABA", "AAC", "CCC", "ABC", "BAC", "CAB", "BAB", "BCB", "CCA", "CAC", 
 "AAA", "BBC", "ABC", "BCB", "CBA", "CAA", "CAB", "CAC", "CBC", "CBC"]

julia> @btime $lst .== permutedims($lst) ;
  9.025 μs (5 allocations: 4.58 KiB)

julia> @btime equal_str($lst) ;
  6.112 μs (8 allocations: 3.08 KiB)

The larger the lst the bigger the difference would be. This applies only to comparing a list with itself, as the OP suggests. To compare two lists, a different algorithm should be employed for O(n*log(n)) time.

Finally, even this algorithm works a little too hard by sorting, but an O(n^2) time/space complexity is inherent in producing the result.

UPDATE: A more linear O(n) time calculation (still O(n^2) to make matrix):

function equal_str_2(lst)
    d = Dict{String,Int}()
    d2 = Dict{Int, Vector{Int}}()
    for p in pairs(lst)
        if haskey(d,p[2])
            push!(d2[d[p[2]]],p[1])
        else
            d[p[2]] = p[1]
            d2[p[1]] = [p[1]]
        end
    end
    res = zeros(Bool, (length(lst), length(lst)))
    for p in values(d2)
        for q in Iterators.product(p,p)
            res[q[1],q[2]] = true
            res[q[2], q[1]] = true
        end
    end
    return res
end

and benchmark with larger lst:

julia> lst = [randstring('A':'C',3) for i=1:140];

julia> @btime $lst .== permutedims($lst) ;
  99.094 μs (5 allocations: 6.89 KiB)

julia> @btime equal_str($lst) ;
  51.981 μs (9 allocations: 23.12 KiB)

julia> @btime equal_str_2($lst) ;
  21.539 μs (72 allocations: 27.47 KiB)
  • Related