I have a pyspark DataFrame with a MapType column that either contains the map<string, int> format or is None.
I need to perform some calculations using collect_list
. But collect_list
excludes None
values and I am trying to find a workaround, by transforming None to string similar to Include null values in collect_list in pyspark
However, on my case I can't use df.na.fill
with subset, since fillna
only supports int, float, string, bool datatypes, columns with other datatypes are ignored. The solution proposed in pyspark fillna is not working on column of ArrayType somehow doesn't work for my case, since I get a type mismatch error.
import pyspark.sql.functions as F
default_value = F.lit('none')
fill_rule = F.when(F.col('map_values').isNull(), default_value).otherwise(F.col('map_values'))
df.withColumn('map_new', fill_rule)
I understand I should change the default_value
format, with .cast('map<string,int>')
, but I don't know how to define it.
I would like to get with something like {'none':1}.
I am new to pyspark, so maybe there is a more efficient way to achieve what I need.
CodePudding user response:
Spark does not provide for a easy way to construct an empty map. However, from_json
can be used to mimic the behavior.
from pyspark.sql import functions as F
df = spark.createDataFrame([({"key": 10}, ), (None, ), ], ("map_values", ))
default_value = F.from_json(F.lit("{}"), "map<string,int>")
fill_rule = F.when(F.col("map_values").isNull(), default_value).otherwise(F.col("map_values"))
df.withColumn("map_new", fill_rule).show()
Output
----------- -----------
| map_values| map_new|
----------- -----------
|{key -> 10}|{key -> 10}|
| null| {}|
----------- -----------
CodePudding user response:
we can create MapType
using create_map
of required literals.
df = spark.createDataFrame([(1,2,{'abc':12}),(3,4,None)],['c1','c2','c3'])
df.show()
--- --- -----------
| c1| c2| c3|
--- --- -----------
| 1| 2|[abc -> 12]|
| 3| 4| null|
--- --- -----------
from pyspark.sql import functions as f
defaultval = f.create_map(f.lit('None'),f.lit(1))
df = df.withColumn('c3',f.when(f.col('c3').isNull(),defaultval).otherwise(f.col('c3'))
df.select(f.collect_list('c3')).show(truncate=False)
--------------------------
|collect_list(c3) |
--------------------------
|[[abc -> 12], [None -> 1]]|
--------------------------