I have a Spark Dataset containing a single column of ArrayType which denotes the path from one user to another through their mutual friends
path |
---|
["Amy","John","Wally"] |
["Beth","Sally","Tim","Jacob"] |
What I would like to achieve in the end is a table that explicitly lists the edges in the paths. (i.e. an edgelist)
src | dest |
---|---|
"Amy" | "John" |
"John" | "Amy" |
"John" | "Wally" |
"Beth" | "Sally" |
"Sally" | "Tim" |
"Tim" | "Sally" |
"Tim" | "Jacob" |
"Jacob" | "Tim" |
How should I go about trying to transform the former table into the latter one?
CodePudding user response:
You can turn each list to list of edges (pairs) by using arrays_zip
on two slice
s - one w/o the last element and one w/o the first element. It will create array of structs, then explode
resulting array to have each struct in a separate row and then turn struct column into two separate columns (withColumn
).
Then you should add reverse nodes and remove duplicates by using distinct
.
I assume that you work with DataFrame and use spark sql functions.