Home > Blockchain >  pyspark dataframe filling group with exact value
pyspark dataframe filling group with exact value

Time:04-07

I have following pyspark dataframe

 ----------- ------------------ 
|search_term|keyword_match_type|
 ----------- ------------------ 
|        st1|             EXACT|
|        st1|               DSA|
|        st2|               DSA|
|        st2|               DSA|
|        st2|               DSA|
 ----------- ------------------ 

I want to make a new column in this dataframe which will look like this

 ----------- ------------------ ----------------- 
|search_term|keyword_match_type|exact_match_exist|
 ----------- ------------------ ----------------- 
|        st1|             EXACT|             True|
|        st1|               DSA|             True|
|        st2|               DSA|            False|
|        st2|               DSA|            False|
|        st2|               DSA|            False|
 ----------- ------------------ ----------------- 

So if at least one EXACT exist in group st1 in col keyword_match_type, whole group st1 in a new column should be filled with true.

the below code works fine but it does not do it over groups

df.withColumn("exact_match_exist", when(col('keyword_match_type').isin("EXACT"), lit('True')).otherwise(lit('False'))).show()

But when I try

w = Window().partitionBy("search_term")
df.withColumn("exact_match_exist", when(col('keyword_match_type').over(w).isin("EXACT"), lit('True')).otherwise(lit('False'))).show()

I'm getting

AnalysisException: Expression 'keyword_match_type#25116' not supported within a window function.;

Any idea how to overcome this ??

CodePudding user response:

In Spark 3, you can use any function :

from pyspark.sql import Window, functions as F

w = Window.partitionBy("search_term")

df1 = df.withColumn(
    "exact_match_exist",
    F.expr("any(keyword_match_type = 'EXACT')").over(w)
)

df1.show()
#  ----------- ------------------ ----------------- 
# |search_term|keyword_match_type|exact_match_exist|
#  ----------- ------------------ ----------------- 
# |        st1|             EXACT|             true|
# |        st1|               DSA|             true|
# |        st2|               DSA|            false|
# |        st2|               DSA|            false|
# |        st2|               DSA|            false|
#  ----------- ------------------ ----------------- 

Or using conditional count like this:

df1 = df.withColumn(
    "exact_match_exist",
    F.count(F.when(F.col("keyword_match_type") == "EXACT", 1)).over(w) > 0
)

CodePudding user response:

Another way, array and

win=Window.partitionBy('search_term')
df.withColumn('exact_match_exist',array_contains(collect_list('keyword_match_type').over(win),'EXACT')).show()

 ----------- ------------------ ----------------- 
|search_term|keyword_match_type|exact_match_exist|
 ----------- ------------------ ----------------- 
|        st1|             EXACT|             true|
|        st1|               DSA|             true|
|        st2|               DSA|            false|
|        st2|               DSA|            false|
|        st2|               DSA|            false|
 ----------- ------------------ ----------------- 
  • Related