I have two dataframes df1 and df2 as shown below
df1:
---------- ------
|CustomerNo|Amount|
---------- ------
| 10024.0| 194.0|
| 10024.0| 194.0|
| 100342.0| 66.15|
| 10058.0| 199.0|
| 10080.0| 206.6|
| 10080.0| 206.6|
| 10080.0| 206.6|
| 10080.0| 206.6|
| 10080.0| 0.0|
| 10080.0|156.75|
df2:
---------- ---
|CustomerNo|Qty|
---------- ---
| 10024.0| 1|
| 10024.0| 1|
| 100342.0| 1|
| 10058.0| 1|
| 10080.0| 2|
| 10080.0| 3|
| 10080.0| 4|
| 10080.0| 5|
| 10080.0| 6|
| 10080.0| 7|
I want to join these two dataframe so that the new dataframe should have only CustomerNo, Qty and Amount with same rows
df = df1.join(df2, df1.CustomerNo==df2.CustomerNo,'inner')
While I am using the above code, the rows are increasing. I would like the final output to be in this format with just the column should be appended.
|CustomerNo|Amount|Qty
---------- ------ ------
| 10024.0| 194.0|1
| 10024.0| 194.0|1
| 100342.0| 66.15|1
| 10058.0| 199.0|1
| 10080.0| 206.6|2
| 10080.0| 206.6|3
| 10080.0| 206.6|4
| 10080.0| 206.6|5
| 10080.0| 0.0|6
| 10080.0|156.75|7
CodePudding user response:
You can do this but.. you should be careful. I'm going to show you how to use a window but they suck at performance so only use windows when you have to not as a first choice. This should work but I'm guessing qty and amount are correlated.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpecQty = Window.partitionBy("CustomerNo").orderBy("Qty")
windowSpecAmount = Window.partitionBy("CustomerNo").orderBy("Amount")
rdf1 = df1.withColumn("row_number",row_number().over(windowSpecAmount)) //grouped by CustomerNo given a rownumber
rdf2 = df2.withColumn("row_number",row_number().over(windowSpecQty)) //grouped by CustomerNo given a rownumber
cond = [rdf1.CustomerNo==rdf2.CustomerNo, rdf1.row_number==rdf2.row_number]
df = rdf1.join(rdf2, cond,'inner').select( rdf1['CustomerNo'], rdf1['Amount'], rdf2['Qty'])