Home > Software design >  How do I JOIN two datasets in Palantir Foundry within a code workbook?
How do I JOIN two datasets in Palantir Foundry within a code workbook?

Time:10-26

Hi I know this is a basic question but I'm new to Foundry and Pyspark, please help! I need to JOIN two datasets in a Code Workbook of Palantir Foundry using 3 columns (two are named the same in both but one uses a different name within the datasets) I'm not sure how to do this. Thank you for your help!

CodePudding user response:

According to the pyspark documentation, you can use a list of columns for the "on" argument (the join keys). If you were joining two datasets (df1 & df2), where df1 had join keys ["a", "b", "c"] and df2 had join keys ["a", "b", "c2"], I would do something like this:

df1.join(df2.withColumnRenamed("c2", "c"), on=["a", "b", "c"], how="left")

CodePudding user response:

As per the PySpark documentation that @kate provided, you just need to specify either

  1. a string representing a column, which must exist on both tables
  2. a list of strings representing multiple columns, which again must exist on both tables
  3. a Column expression, which allows you to do some more complex logic on your join condition. For example, you may want to join to tables on the condition that the date column in table A is in between date_before and date_after in table B. This would look something like df_a.join(df_b, on=((df_a.date < df_b.date_after) & (df_a.data > df_b.date_before))) so you have a lot of flexibility here in terms of how you can join datasets
  • Related