Home > Mobile >  Group by two columns where values may be switched
Group by two columns where values may be switched

Time:11-04

I have a dataframe that looks like this:

Start End Total
X Y 50
X Z 100
Y Z 50
Y X 100

I am trying to group by pairs "Start" --> "End", where values may be switched. I'm calculating the sum of "Total". The result would look like this:

Start End Total
X Y 150
X Z 100
Y Z 50

X --> Y and Y --> X are equivalent and I want to sum the both under X --> Y.

I am quite stumped with this problem, and I am not even sure how to proceed conceptually with this. My current code gives me one direction, but how do I combine both directions?

df2 = df.groupBy("Start", "End").sum()

CodePudding user response:

Merge 'Start' and 'End' columns into array and then aggregate.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('X', 'Y', 50),
     ('X', 'Z', 100),
     ('Y', 'Z', 50),
     ('Y', 'X', 100)],
    ['Start', 'End', 'Total'])

Script:

arr = F.array_sort(F.array('Start', 'End'))
df = df.groupBy(arr[0].alias('Start'), arr[1].alias('End')).agg(
    F.sum('Total').alias('Total')
)
df.show()
#  ----- --- ----- 
# |Start|End|Total|
#  ----- --- ----- 
# |    X|  Y|  150|
# |    X|  Z|  100|
# |    Y|  Z|   50|
#  ----- --- ----- 
  • Related