I have two huge dataframes that even contain columns with the same name that have no connection whatsoever. I have 2 join keys, though, and I want to add to data_left just one column from data_right. I tried:
output_df = data_left.join(data_right, on=["join_key_1", "join_key_2"], how="left").select("data_left.*", "data_right.extraColumn")
But it does not recognize the *
even after importing it.
Sample:
data_left =
col_1 col_2 join_key_1 join_key_2
12 a a_b 1
14 c r_t 2
12 d v_b 1
24 r a_s 2
data_right =
col_3 col_4 join_key_1 join_key_2 extraColumn
12 a a_b 1 456
14 g r_t 2 654
15 e v_c 5 464
24 r a_s 2 546
12 d v_b 1 549
output_df =
col_1 col_2 join_key_1 join_key_2 extraColumn
12 a a_b 1 456
14 c r_t 2 654
12 d v_b 1 546
24 r a_s 2 549
If there is no correspondent group of join keys in the data_right
, we keep the extraColumn
empty.
CodePudding user response:
Would this work for your usecase? :
output_df = data_left.join(data_right.select("join_key_1", "join_key_2", "extraColumn"), on=["join_key_1", "join_key_2"], how="left")