PySpark : Find Excess quantity in table-CodePudding

Very new to pyspark/pandas need help.

I have a table Table 1 that looks something like this

-----------------
Id    ID2   qty 
-----------------
A1    S1     5
A2    S1     5
A3    S2     10
----------------

I have a table Table 2 that looks something like this

-----------------
ID2   totalQty 
-----------------
S1     15
S2     10
----------------

My result should be

-----------------
ID2   qty 
-----------------
S1     5
S2     0
----------------

I tried joining both table and I am able to find the row that nap to zero but dont know how find that row with excess quantity

result_df = table1.join(table2, "ID2", "left").withColumn('remainingQty', F.col('totalQty') - F.col('qty'))

the result_df I get is

----------------------------------------
Id    ID2   qty  totalQty  remainingQty
----------------------------------------
A1    S1     5     15       10
A2    S1     5     15       10
A3    S2     10    10       0
-----------------------------------------

I know the first two row in the above table are wrong. need help on this.

CodePudding user response：

You first need to sum qty by ID2 in table1 before joining it with table2, as follows:

from pyspark.sql import functions as F

result_df = table1.groupBy('ID2').agg(F.sum('qty').alias('qty')) \
    .join(table2, 'ID2') \
    .select(F.col('ID2'), (F.col('totalQty') - F.col('qty')).alias('qty'))

And with your input table1 and table2, you will get the following result_df:

 --- --- 
|ID2|qty|
 --- --- 
|S2 |0  |
|S1 |5  |
 --- ---