Home > OS >  Check column of df1 exists in col of df2
Check column of df1 exists in col of df2

Time:11-15

I have a df1 with column values as below

names
AB
DC
DE
FG
GG
TR

Another df2 as

date                names
2022-11-01           AB
2022-11-01           DE
2011-11-01           FG
2022-11-02           DC
2022-11-02           GG
2022-11-02           TR

I want to check if all values of df1 column exists in df2 names column, if yes update true else false in a new column.

I am able to do it for a given single date using dataframes with flag column. Using when.otherwise to check the flag value. I am not to run this across many days.

CodePudding user response:

This should do the trick:

 df1["exists"] = df1["names"].isin(df2["names"].unique())

CodePudding user response:

import pyspark.sql.functions as F

df1 = spark.createDataFrame(
    [
    ('AB',),('DC',),('DE',),('FG',),('GG',),('TR',)
    ]
).toDF("names")

df2 = spark.createDataFrame(
    [
    ('2022-11-01','AB'),
    ('2022-11-01','DE'),
    ('2011-11-01','FG'),
    ('2022-11-02','DC'),
    ('2022-11-02','GG'),
    ('2022-11-02','TR'),
    ('2022-11-02','ZZ'),
    ],
    ["date", "names"]
)\
    .withColumn('date', F.col('date').cast('date'))

df1 = df1.withColumn('bool', F.lit('True').cast('boolean'))

df2\
    .join(df1, on='names', how='left')\
    .fillna(False)\
    .select('date', 'names', 'bool')\
    .show()

#  ---------- ----- ----- 
# |      date|names| bool|
#  ---------- ----- ----- 
# |2022-11-01|   AB| true|
# |2022-11-02|   DC| true|
# |2022-11-01|   DE| true|
# |2011-11-01|   FG| true|
# |2022-11-02|   GG| true|
# |2022-11-02|   TR| true|
# |2022-11-02|   ZZ|false|
#  ---------- ----- ----- 
  • Related