Home > OS >  pyspark - How to define MapType for when/otherwise
pyspark - How to define MapType for when/otherwise

Time:12-09

I have a pyspark DataFrame with a MapType column that either contains the map<string, int> format or is None.

I need to perform some calculations using collect_list. But collect_list excludes None values and I am trying to find a workaround, by transforming None to string similar to Include null values in collect_list in pyspark

However, on my case I can't use df.na.fill with subset, since fillna only supports int, float, string, bool datatypes, columns with other datatypes are ignored. The solution proposed in pyspark fillna is not working on column of ArrayType somehow doesn't work for my case, since I get a type mismatch error.

import pyspark.sql.functions as F
default_value = F.lit('none')
fill_rule = F.when(F.col('map_values').isNull(), default_value).otherwise(F.col('map_values'))
df.withColumn('map_new', fill_rule)

I understand I should change the default_value format, with .cast('map<string,int>'), but I don't know how to define it. I would like to get with something like {'none':1}.

I am new to pyspark, so maybe there is a more efficient way to achieve what I need.

CodePudding user response:

Spark does not provide for a easy way to construct an empty map. However, from_json can be used to mimic the behavior.

from pyspark.sql import functions as F

df = spark.createDataFrame([({"key": 10}, ), (None, ), ], ("map_values", ))

default_value = F.from_json(F.lit("{}"), "map<string,int>")
fill_rule = F.when(F.col("map_values").isNull(), default_value).otherwise(F.col("map_values"))

df.withColumn("map_new", fill_rule).show()

Output

 ----------- ----------- 
| map_values|    map_new|
 ----------- ----------- 
|{key -> 10}|{key -> 10}|
|       null|         {}|
 ----------- ----------- 

CodePudding user response:

we can create MapType using create_map of required literals.

df = spark.createDataFrame([(1,2,{'abc':12}),(3,4,None)],['c1','c2','c3'])
df.show()
 --- --- ----------- 
| c1| c2|         c3|
 --- --- ----------- 
|  1|  2|[abc -> 12]|
|  3|  4|       null|
 --- --- ----------- 

from pyspark.sql import functions as f

defaultval = f.create_map(f.lit('None'),f.lit(1))
df = df.withColumn('c3',f.when(f.col('c3').isNull(),defaultval).otherwise(f.col('c3'))
df.select(f.collect_list('c3')).show(truncate=False)
 -------------------------- 
|collect_list(c3)          |
 -------------------------- 
|[[abc -> 12], [None -> 1]]|
 -------------------------- 
  • Related