i have dataframe1 that contains contracts and i have dataframe2 that contains workers now i want to filter dataframe1 with a column from dataframe2. i tried at first to filter dataframe1 with one string and it works, this is the code :
contract_con=dataframe1.filter(dataframe1.name_of_column.contains('Entretien des espaces naturels')
and this is the code i tried to make to filter the same dataframe1 with a column of an other dataframe2 that contains 10 lines:
contract_con=dataframe1.filter(dataframe1.name_of_column.contains(dataframe2.name_of_column))
contract_con.show()
any help please ?
CodePudding user response:
the solution to make a list from dataframe1 and use it to filter dataframe2 this is the code to make the list:
job_list=dataframe1.select("name_of_column").rdd.flatMap(lambda x: x).collect()
print(job_list)
and this is the code to filter it :
from pyspark.sql.functions import col
contract_workers=dataframe2.filter(col("name_of_column_to_filter").isin(job_list))
contract_workers.show()
CodePudding user response:
Since it is a different dataframe you cannot pass the column directly. You could use isin()
after collecting the dataframe2.name_of_column
into a list. But the easiest way is just to do a join like this:
contract_con = dataframe1.join(dataframe2, "name_of_column", "inner")
contract_con.show()