I have following pyspark dataframe
----------- ------------------
|search_term|keyword_match_type|
----------- ------------------
| st1| EXACT|
| st1| DSA|
| st2| DSA|
| st2| DSA|
| st2| DSA|
----------- ------------------
I want to make a new column in this dataframe which will look like this
----------- ------------------ -----------------
|search_term|keyword_match_type|exact_match_exist|
----------- ------------------ -----------------
| st1| EXACT| True|
| st1| DSA| True|
| st2| DSA| False|
| st2| DSA| False|
| st2| DSA| False|
----------- ------------------ -----------------
So if at least one EXACT
exist in group st1
in col keyword_match_type
, whole group st1
in a new column should be filled with true
.
the below code works fine but it does not do it over groups
df.withColumn("exact_match_exist", when(col('keyword_match_type').isin("EXACT"), lit('True')).otherwise(lit('False'))).show()
But when I try
w = Window().partitionBy("search_term")
df.withColumn("exact_match_exist", when(col('keyword_match_type').over(w).isin("EXACT"), lit('True')).otherwise(lit('False'))).show()
I'm getting
AnalysisException: Expression 'keyword_match_type#25116' not supported within a window function.;
Any idea how to overcome this ??
CodePudding user response:
In Spark 3, you can use any
function :
from pyspark.sql import Window, functions as F
w = Window.partitionBy("search_term")
df1 = df.withColumn(
"exact_match_exist",
F.expr("any(keyword_match_type = 'EXACT')").over(w)
)
df1.show()
# ----------- ------------------ -----------------
# |search_term|keyword_match_type|exact_match_exist|
# ----------- ------------------ -----------------
# | st1| EXACT| true|
# | st1| DSA| true|
# | st2| DSA| false|
# | st2| DSA| false|
# | st2| DSA| false|
# ----------- ------------------ -----------------
Or using conditional count like this:
df1 = df.withColumn(
"exact_match_exist",
F.count(F.when(F.col("keyword_match_type") == "EXACT", 1)).over(w) > 0
)
CodePudding user response:
Another way, array and
win=Window.partitionBy('search_term')
df.withColumn('exact_match_exist',array_contains(collect_list('keyword_match_type').over(win),'EXACT')).show()
----------- ------------------ -----------------
|search_term|keyword_match_type|exact_match_exist|
----------- ------------------ -----------------
| st1| EXACT| true|
| st1| DSA| true|
| st2| DSA| false|
| st2| DSA| false|
| st2| DSA| false|
----------- ------------------ -----------------