I have a sparse symmetric matrix which represents authors of some book. Elements Ai,j and Aj,i are both equal to one if the people associated with indices i and j are coauthors and equal to zero otherwise. I'm trying to find a way in matrix representation such that given two columns (authors), I find their common co-authors. Preferably in Matlab or Julia code representation.
CodePudding user response:
The binary &
between the columns, applied elementwise, will return a vector with 1
s only where both columns had 1
s. You can do a findall
over that to then return the indices where the result is 1
, which indicates the common co-authors.
julia> A
5×5 SparseMatrixCSC{Bool, Int64} with 12 stored entries:
⋅ 1 ⋅ 1 1
1 ⋅ ⋅ ⋅ 1
⋅ ⋅ ⋅ ⋅ 1
1 ⋅ ⋅ ⋅ 1
1 1 1 1 ⋅
julia> common = A[:, 1] .& A[:, 5]
5-element SparseVector{Bool, Int64} with 2 stored entries:
[2] = 1
[4] = 1
julia> findall(common)
2-element Vector{Int64}:
2
4
This finds the common co-authors between authors 1 and 5, in Julia. The .
before the &
indicates that the operator should be applied elementwise. To generalize this, you can write it as a function like:
julia> function findcommoncoauths(adjmat, author1, author2)
@views findall(adjmat[:, author1] .& adjmat[:, author2])
end
(The @views
is to avoid unnecessarily allocating new memory for the columns, which is a good practice for performant code.)
CodePudding user response:
In case the author collaboration matrix is big, it is nice to take advantage of sparseness. SparseArrays
offers many optimized functions (some exported and some not). For example dot
(defined in LinearAlgebra) on SparseVectors which calculates dot product. Since the sparse matrices are stored in columns (CSC format), getting columns as SparseVectors is fast. So to count the number of common co-authors between authors i
and j
:
@views dot(adjmat[:,i], adjmat[:,j])
works fast. Here is an example:
julia> using SparseArrays, LinearAlgebra
julia> M = sprand(Bool,100,100,0.1); # random boolean matrix
julia> M = M .| M'; # symetric coauthorship
julia> M[diagind(M)] .= false; # no self-coauthorship
julia> M
100×100 SparseMatrixCSC{Bool, Int64} with 1841 stored entries:
⠊⡠⠗⠀⠕⡛⢲⣢⠼⠥⠀⢀⡂⠲⠦⠋⡐⠰⠆⠲⠊⠖⡷⠿⠅⠘⡫⠱⠳⠔
⠙⠁⣏⠙⡣⢎⠓⢮⢢⠣⡴⢀⡂⣊⢙⣈⣯⢮⡁⣂⣺⡏⡰⢁⠯⢜⠐⣈⠼⠜
⣵⠡⡩⢎⡁⡨⣦⣥⠙⡌⠯⡙⣏⡬⢛⠀⢡⢀⢔⢔⢲⡍⠚⣁⣉⢥⡁⣆⣤⠇
⠸⣲⡹⣄⠌⣿⢴⠓⣝⠒⣂⢕⠏⡱⡒⣠⠶⡰⠆⣈⠑⡖⣇⡐⠺⠠⠄⡖⠹⡦
⠖⡇⠬⡒⡓⠤⢳⠙⠋⠄⠜⢰⡖⠨⡀⡄⣒⡨⠒⡘⠤⠄⠆⠆⢮⢜⠥⠴⠔⠄
⠀⢀⠐⢋⣏⠣⢌⢜⢒⣁⣋⠘⡎⢍⠸⠢⢻⣸⡀⢋⡈⣂⣏⠋⢸⢰⣖⢰⡡⡆
⢨⡈⡨⢨⡋⡽⢏⡡⡘⡉⡎⢍⣊⠘⢯⣍⡸⡪⢅⣫⣾⡉⠶⠀⠿⠈⢥⠽⢅⡍
⡬⠃⡓⢰⠛⠐⠘⣨⠀⠬⠲⡂⡏⢷⣀⠘⠊⢲⠃⡰⠠⡆⢂⠔⣕⢂⡄⣺⠖⠮
⢐⡈⡫⣟⠁⢒⢘⡣⡘⡸⣛⣲⡲⡪⢪⣀⠎⠅⣒⣙⣙⡇⠳⢊⠕⢸⣢⣒⡫⣂
⢨⡁⠡⢨⢐⢕⡈⢡⣘⠠⡤⢈⡥⣱⢉⡠⣜⢸⢤⡷⠨⠍⣹⠀⠀⢰⢺⣫⠬⡥
⢪⠄⡾⠾⡜⠶⢱⠤⠀⠇⠢⢨⡞⠻⠠⠦⠷⠼⡆⠆⠴⠃⢾⠔⠖⠘⢇⠞⠲⡶
⣽⡏⠔⢊⠞⢠⢉⠹⠨⠅⡯⠙⠘⠃⢈⠔⡹⢂⠓⠚⢚⠗⡎⠉⠟⢊⠂⡑⣡⠁
⣁⠁⣋⢇⠇⣜⠚⡂⣊⢗⢒⣒⡛⠃⠱⢙⣑⣁⢀⣀⣘⠁⡻⢁⠺⠂⡿⢚⠟⡐
⢏⡊⡐⢠⠡⢬⢠⠥⢁⡇⢘⣙⣅⡗⣠⣩⢨⢺⡾⣲⣩⠕⢌⠠⣻⢋⠀⠀⢩⣣
⢙⠆⣒⠇⠤⠟⠳⡦⠐⠅⠡⠮⡅⠵⡸⡅⠫⢪⠆⡧⢸⡦⠅⠚⢛⠡⠧⣲⠀⠀
julia> @views dot(M[:,1], M[:,2])
5
The last line shows authors #1 and author #2 have 5 common coauthors.
The question wants a list of common coauthors, and unfortunately dot
loses this information by aggregation.
Another answer to this post (by Sundar) provides a solution:
function findcommoncoauths(adjmat, author1, author2)
@views findall(adjmat[:, author1] .& adjmat[:, author2])
end
giving the result for the above matrix:
julia> findcommoncoauths(M, 1, 2)
5-element Vector{Int64}:
17
74
78
80
88
This method employs the .&
operator which uses broadcasting and so doesn't take advantage of matrix sparsness. Additionally, although the question does not specify, the common co-authors might be needed just for processing and not necessarily stored, so it would be nice to iterate over them.
The following is an iterator to do so:
struct SparseColumnCommon{T,I}
iistart::I
jjstart::I
iimax::I
jjmax::I
parent::SparseMatrixCSC{T,I}
end
function sparsecolumncommon(M::SparseMatrixCSC{T,I}, i::I, j::I) where {T,I}
ii = first(nzrange(M,i))
iimax = last(nzrange(M,i))
jj = first(nzrange(M,j))
jjmax = last(nzrange(M,j))
SparseColumnCommon{T,I}(ii, jj, iimax, jjmax, M)
end
Base.eltype(::Type{SparseColumnCommon{T,I}}) where {T,I} = Tuple{I, T, T}
Base.IteratorSize(::Type{SparseColumnCommon{T,I}}) where {T,I} = Iterators.Base.SizeUnknown()
function Base.iterate(it::SparseColumnCommon, state=(it.iistart, it.jjstart))
ii, jj = state
while ii <= it.iimax && jj <= it.jjmax
ri, rj = (it.parent.rowval)[ii], (it.parent.rowval)[jj]
if ri == rj
return ((ri, (it.parent.nzval)[ii], (it.parent.nzval)[jj]), (ii 1, jj 1))
elseif ri < rj
ii = 1
else rj < ri
jj = 1
end
end
return nothing
end
With this defined (perhaps in a separate included file), we can use it to define:
findcommoncoauths2(adjmat, author1, author2) =
map(first, sparsecolumncommon(adjmat, author1, author2))
And like before:
julia> findcommoncoauths2(M,1,2)
5-element Vector{Int64}:
17
74
78
80
88
The iterator can also be used to process common co-authors one by one without allocating an output vector which is perhaps useful.
Finally, perhaps this iterator can also be used for other sparse matrix column tasks. It returns on each iteration a triplet: (row, column1val, column2val)
.
This is already a long answer, but benchmarks show that using this iterator is faster for sparse and big enough matrices (enough is not too large).