Very new to pyspark/pandas need help.
I have a table Table 1 that looks something like this
-----------------
Id ID2 qty
-----------------
A1 S1 5
A2 S1 5
A3 S2 10
----------------
I have a table Table 2 that looks something like this
-----------------
ID2 totalQty
-----------------
S1 15
S2 10
----------------
My result should be
-----------------
ID2 qty
-----------------
S1 5
S2 0
----------------
I tried joining both table and I am able to find the row that nap to zero but dont know how find that row with excess quantity
result_df = table1.join(table2, "ID2", "left").withColumn('remainingQty', F.col('totalQty') - F.col('qty'))
the result_df I get is
----------------------------------------
Id ID2 qty totalQty remainingQty
----------------------------------------
A1 S1 5 15 10
A2 S1 5 15 10
A3 S2 10 10 0
-----------------------------------------
I know the first two row in the above table are wrong. need help on this.
CodePudding user response:
You first need to sum qty
by ID2
in table1 before joining it with table2, as follows:
from pyspark.sql import functions as F
result_df = table1.groupBy('ID2').agg(F.sum('qty').alias('qty')) \
.join(table2, 'ID2') \
.select(F.col('ID2'), (F.col('totalQty') - F.col('qty')).alias('qty'))
And with your input table1 and table2, you will get the following result_df
:
--- ---
|ID2|qty|
--- ---
|S2 |0 |
|S1 |5 |
--- ---