I have a dataframe that looks like this:
Start | End | Total |
---|---|---|
X | Y | 50 |
X | Z | 100 |
Y | Z | 50 |
Y | X | 100 |
I am trying to group by pairs "Start" --> "End", where values may be switched. I'm calculating the sum of "Total". The result would look like this:
Start | End | Total |
---|---|---|
X | Y | 150 |
X | Z | 100 |
Y | Z | 50 |
X --> Y and Y --> X are equivalent and I want to sum the both under X --> Y.
I am quite stumped with this problem, and I am not even sure how to proceed conceptually with this. My current code gives me one direction, but how do I combine both directions?
df2 = df.groupBy("Start", "End").sum()
CodePudding user response:
Merge 'Start' and 'End' columns into array
and then aggregate.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('X', 'Y', 50),
('X', 'Z', 100),
('Y', 'Z', 50),
('Y', 'X', 100)],
['Start', 'End', 'Total'])
Script:
arr = F.array_sort(F.array('Start', 'End'))
df = df.groupBy(arr[0].alias('Start'), arr[1].alias('End')).agg(
F.sum('Total').alias('Total')
)
df.show()
# ----- --- -----
# |Start|End|Total|
# ----- --- -----
# | X| Y| 150|
# | X| Z| 100|
# | Y| Z| 50|
# ----- --- -----