How would you create a function that checks if one row in a PySpark column(s) of a dataframe matches another row in the same column(s) of another Pysark dataframe? I want to make a new column that shows validation if that value in the record exists in the other dataframe. The dataframes don't have identical columns, except for the column(s) to join on. I'm new to PySpark. I know you have to use a withColum to create a new column that validates 'True' or 'False' for the differences. The problem is, that I'm using the subtract function to get differences of values on dataframes that don't have identical columns. So far, I have this:
def DoesValueMatch(df1, df2, column_name):
name=column_name
df1=df1.select(name)
df2=df2.select(name)
df3=df1.subtract(df1)
return df3.show()
Thank you in advance! Much help is appreciated.
CodePudding user response:
Let's say you have the following dataframes.
df1.show()
----- ------
| name|gender|
----- ------
| john| M|
| amy| F|
|sally| F|
| sam| M|
----- ------
df2.show()
----- ---
| name|age|
----- ---
| john| 12|
|tommy| 3|
|sally| 8|
| mary| 4|
----- ---
You can do the following to create a column indicating if the "name" column in df1 exists in df2.
from pyspark.sql import functions
df3 = (df1.join(df2, how="left", on="name")
.withColumn("name_in_df2",
functions.when(df2["age"].isNull(), False)
.otherwise(True)))
df3.show()
----- ------ ---- -----------
| name|gender| age|name_in_df2|
----- ------ ---- -----------
| john| M| 12| true|
| amy| F|null| false|
|sally| F| 8| true|
| sam| M|null| false|
----- ------ ---- -----------