Home > Blockchain >  filter dataframe 1 column with other dataframe column pyspark
filter dataframe 1 column with other dataframe column pyspark

Time:08-20

i have dataframe1 that contains contracts and i have dataframe2 that contains workers now i want to filter dataframe1 with a column from dataframe2. i tried at first to filter dataframe1 with one string and it works, this is the code :

contract_con=dataframe1.filter(dataframe1.name_of_column.contains('Entretien des espaces naturels')

and this is the code i tried to make to filter the same dataframe1 with a column of an other dataframe2 that contains 10 lines:

contract_con=dataframe1.filter(dataframe1.name_of_column.contains(dataframe2.name_of_column))
contract_con.show()

any help please ?

CodePudding user response:

the solution to make a list from dataframe1 and use it to filter dataframe2 this is the code to make the list:

job_list=dataframe1.select("name_of_column").rdd.flatMap(lambda x: x).collect()
print(job_list)

and this is the code to filter it :

from pyspark.sql.functions import col
contract_workers=dataframe2.filter(col("name_of_column_to_filter").isin(job_list))
contract_workers.show()

CodePudding user response:

Since it is a different dataframe you cannot pass the column directly. You could use isin() after collecting the dataframe2.name_of_column into a list. But the easiest way is just to do a join like this:

contract_con = dataframe1.join(dataframe2, "name_of_column", "inner")
contract_con.show()
  • Related