Home > database >  merge column to the pyspark dataframe without impacting the rows
merge column to the pyspark dataframe without impacting the rows

Time:05-03

I have two dataframes df1 and df2 as shown below

df1:

 ---------- ------ 
|CustomerNo|Amount|
 ---------- ------ 
|   10024.0| 194.0|
|   10024.0| 194.0|
|  100342.0| 66.15|
|   10058.0| 199.0|
|   10080.0| 206.6|
|   10080.0| 206.6|
|   10080.0| 206.6|
|   10080.0| 206.6|
|   10080.0|   0.0|
|   10080.0|156.75|

df2:

 ---------- --- 
|CustomerNo|Qty|
 ---------- --- 
|   10024.0|  1|
|   10024.0|  1|
|  100342.0|  1|
|   10058.0|  1|
|   10080.0|  2|
|   10080.0|  3|
|   10080.0|  4|
|   10080.0|  5|
|   10080.0|  6|
|   10080.0|  7|

I want to join these two dataframe so that the new dataframe should have only CustomerNo, Qty and Amount with same rows

df = df1.join(df2, df1.CustomerNo==df2.CustomerNo,'inner')

While I am using the above code, the rows are increasing. I would like the final output to be in this format with just the column should be appended.

    |CustomerNo|Amount|Qty
     ---------- ------ ------ 
    |   10024.0| 194.0|1
    |   10024.0| 194.0|1
    |  100342.0| 66.15|1
    |   10058.0| 199.0|1
    |   10080.0| 206.6|2
    |   10080.0| 206.6|3
    |   10080.0| 206.6|4
    |   10080.0| 206.6|5
    |   10080.0|   0.0|6
    |   10080.0|156.75|7

CodePudding user response:

You can do this but.. you should be careful. I'm going to show you how to use a window but they suck at performance so only use windows when you have to not as a first choice. This should work but I'm guessing qty and amount are correlated.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpecQty  = Window.partitionBy("CustomerNo").orderBy("Qty")
windowSpecAmount  = Window.partitionBy("CustomerNo").orderBy("Amount")

rdf1 = df1.withColumn("row_number",row_number().over(windowSpecAmount)) //grouped by CustomerNo given a rownumber

rdf2 = df2.withColumn("row_number",row_number().over(windowSpecQty)) //grouped by CustomerNo given a rownumber

cond = [rdf1.CustomerNo==rdf2.CustomerNo, rdf1.row_number==rdf2.row_number]
df = rdf1.join(rdf2, cond,'inner').select( rdf1['CustomerNo'], rdf1['Amount'], rdf2['Qty'])
  • Related