Home > Enterprise >  PySpark : Find Excess quantity in table
PySpark : Find Excess quantity in table

Time:12-18

Very new to pyspark/pandas need help.

I have a table Table 1 that looks something like this

-----------------
Id    ID2   qty 
-----------------
A1    S1     5
A2    S1     5
A3    S2     10
----------------

I have a table Table 2 that looks something like this

-----------------
ID2   totalQty 
-----------------
S1     15
S2     10
----------------

My result should be

-----------------
ID2   qty 
-----------------
S1     5
S2     0
----------------

I tried joining both table and I am able to find the row that nap to zero but dont know how find that row with excess quantity

result_df = table1.join(table2, "ID2", "left").withColumn('remainingQty', F.col('totalQty') - F.col('qty'))

the result_df I get is

----------------------------------------
Id    ID2   qty  totalQty  remainingQty
----------------------------------------
A1    S1     5     15       10
A2    S1     5     15       10
A3    S2     10    10       0
-----------------------------------------

I know the first two row in the above table are wrong. need help on this.

CodePudding user response:

You first need to sum qty by ID2 in table1 before joining it with table2, as follows:

from pyspark.sql import functions as F

result_df = table1.groupBy('ID2').agg(F.sum('qty').alias('qty')) \
    .join(table2, 'ID2') \
    .select(F.col('ID2'), (F.col('totalQty') - F.col('qty')).alias('qty'))

And with your input table1 and table2, you will get the following result_df:

 --- --- 
|ID2|qty|
 --- --- 
|S2 |0  |
|S1 |5  |
 --- --- 
  • Related