If I have two dataframes:
df_1 =
my_id col_1 col_2 col_3
10001 1 2 3
10002 6 1 3
10003 7 2 7
10004 9 3 4
10005 1 2 8
10006 1 9 5
df_2 =
my_new_id col_4 col_5 col_final
10005 1 2 10
10006 6 1 10
And I want:
df_otput =
my_id col_1 col_2 col_3 col_final
10005 1 2 8 10
10006 1 9 5 10
I would do a left join with the smaller df_2
on the left. But do I have to manually write all the columns that I want in the output (in reality df_1
has dozens of columns)?
df_output = df_2.join(df_1.select("col_1", "col_2", "col_3), df_2["my_new_id"] == df_1["my_id"], how="left").drop('my_new_id')
Also, my code above does not include col_final
. How could I solve this correctly?
CodePudding user response:
You can use the asterisk with df_1
:
df_output = (df_2
.join(df_1, df_2.my_new_id == df_1.my_id, 'left')
.select(df_1['*'], 'col_final')
)
CodePudding user response:
You can try like this:
from pyspark.sql.functions import col
df1_columns = ["my_id","col_1","col_2","col_3"]
df2_columns = ["my_new_id","col_4","col_5","col_final",]
df1_values = [
(10001, 1, 2, 3),
(10002, 6, 1, 3),
(10003, 7, 2, 7),
(10004, 8, 3, 4)
]
df2_values = [
(10001, 1, 2, 3),
(10002, 6, 1, 3),
(10003, 7, 2, 7),
(10004, 8, 3, 4)
]
df1 = spark.createDataFrame(data=df1_values, schema = df1_columns).alias("a")
df2 = spark.createDataFrame(data=df2_values, schema = df2_columns).alias("b")
df1.join(df2, col('a.my_id') == col('b.my_new_id'), "inner").select(col('a.*'), col('b.col_final')).show()