Left joining but keeping all the columns from the dataframe on the right-CodePudding

If I have two dataframes:

df_1 =

my_id    col_1    col_2    col_3
10001        1        2        3
10002        6        1        3
10003        7        2        7
10004        9        3        4
10005        1        2        8
10006        1        9        5

df_2 =

my_new_id    col_4    col_5    col_final
10005        1        2               10
10006        6        1               10

And I want:

df_otput =
    
    my_id    col_1    col_2      col_3     col_final
    10005         1        2        8             10
    10006         1        9        5             10

I would do a left join with the smaller df_2 on the left. But do I have to manually write all the columns that I want in the output (in reality df_1 has dozens of columns)?

df_output = df_2.join(df_1.select("col_1", "col_2", "col_3), df_2["my_new_id"] == df_1["my_id"], how="left").drop('my_new_id')

Also, my code above does not include col_final. How could I solve this correctly?

CodePudding user response：

You can use the asterisk with df_1:

df_output = (df_2
    .join(df_1, df_2.my_new_id == df_1.my_id, 'left')
    .select(df_1['*'], 'col_final')
)

CodePudding user response：

You can try like this:

from pyspark.sql.functions import col

df1_columns = ["my_id","col_1","col_2","col_3"]
df2_columns = ["my_new_id","col_4","col_5","col_final",]

df1_values = [
    (10001, 1, 2, 3),
    (10002, 6, 1, 3),
    (10003, 7, 2, 7),
    (10004, 8, 3, 4)
]

df2_values = [
    (10001, 1, 2, 3),
    (10002, 6, 1, 3),
    (10003, 7, 2, 7),
    (10004, 8, 3, 4)
]

df1 = spark.createDataFrame(data=df1_values, schema = df1_columns).alias("a")
df2 = spark.createDataFrame(data=df2_values, schema = df2_columns).alias("b")

df1.join(df2, col('a.my_id') == col('b.my_new_id'), "inner").select(col('a.*'), col('b.col_final')).show()