Will pyspark inner join keep duplicates in the first dataset in the following example?-CodePudding

So, I have the following two dataframes:

DF1 looks like following:

 --------- ----------- 
|  Name   |   Money   |
 --------- ----------- 
|    A    |   50.3    |
|    B    |   26.9    |
|    C    |   11.4    |
|    A    |   35.8    |
|    B    |   59.2    |
|    A    |   90.8    |
|    C    |   23.5    |
|    D    |   23.5    |
|    D    |   54.6    |
|    E    |   78.0    |
|    A    |   12.3    | 
|    F    |   20.3    |
|    A    |   57.1    | 
 --------- -----------

DF2 looks like following (list of unique names):

 --------- 
|  Name   |
 --------- 
|    A    |
|    C    |
|    D    |
 ---------

What kind of join will give me the following (only keeping A, C and D that appear in DF2):

 --------- ----------- 
|  Name   |   Money   |
 --------- ----------- 
|    A    |   50.3    |
|    C    |   11.4    |
|    A    |   35.8    |
|    A    |   90.8    |
|    C    |   23.5    |
|    D    |   23.5    |
|    D    |   54.6    |
|    A    |   12.3    | 
|    A    |   57.1    | 
 --------- -----------

CodePudding user response：

Select * from df1 
where df1.name in (
  Select name from df2)

Sorry if dirty, written on my phone.

CodePudding user response：

df = df1.alias("t0").join(
    df2.alias("t1"),
    on=f.col("t0.Name") == f.col("t1.Name"),
    how='left'
).select(
    f.col("t0.Name"),
    f.col("t0.Money")
)

CodePudding user response：

imo simple inner join will be ok here, it is going to filter records from df1 with names which do not exists in df2 which is what you expect

df1.join(df2,d1("name") === d2("name"),"inner")

In the title you asked about duplicates, duplicated record are going to stay there after inner join, if you want to remove them you can use distinct

you can find sample inner join below (inner is default type of join in Spark and that's why i just could skip 3rd parameter here)