I have a df1 with column values as below
names
AB
DC
DE
FG
GG
TR
Another df2 as
date names
2022-11-01 AB
2022-11-01 DE
2011-11-01 FG
2022-11-02 DC
2022-11-02 GG
2022-11-02 TR
I want to check if all values of df1 column exists in df2 names column, if yes update true else false in a new column.
I am able to do it for a given single date using dataframes with flag column. Using when.otherwise to check the flag value. I am not to run this across many days.
CodePudding user response:
This should do the trick:
df1["exists"] = df1["names"].isin(df2["names"].unique())
CodePudding user response:
import pyspark.sql.functions as F
df1 = spark.createDataFrame(
[
('AB',),('DC',),('DE',),('FG',),('GG',),('TR',)
]
).toDF("names")
df2 = spark.createDataFrame(
[
('2022-11-01','AB'),
('2022-11-01','DE'),
('2011-11-01','FG'),
('2022-11-02','DC'),
('2022-11-02','GG'),
('2022-11-02','TR'),
('2022-11-02','ZZ'),
],
["date", "names"]
)\
.withColumn('date', F.col('date').cast('date'))
df1 = df1.withColumn('bool', F.lit('True').cast('boolean'))
df2\
.join(df1, on='names', how='left')\
.fillna(False)\
.select('date', 'names', 'bool')\
.show()
# ---------- ----- -----
# | date|names| bool|
# ---------- ----- -----
# |2022-11-01| AB| true|
# |2022-11-02| DC| true|
# |2022-11-01| DE| true|
# |2011-11-01| FG| true|
# |2022-11-02| GG| true|
# |2022-11-02| TR| true|
# |2022-11-02| ZZ|false|
# ---------- ----- -----