I am trying to extract the value from the MapType column in PySpark dataframe in the UDF function.
Below is the PySpark dataframe:
----------- ------------ -------------
|CUSTOMER_ID|col_a |col_b |
----------- ------------ -------------
| 100 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 101 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 102 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 103 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 104 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 105 |{0.0 -> 1.0}| {0.2 -> 1.0}|
----------- ------------ -------------
df.printSchema()
# root
# |-- CUSTOMER_ID: integer (nullable = true)
# |-- col_a: map (nullable = true)
# | |-- key: float
# | |-- value: float (valueContainsNull = true)
# |-- col_b: map (nullable = true)
# | |-- key: float
# | |-- value: float (valueContainsNull = true)
Below is the UDF
@F.udf(T.FloatType())
def test(col):
return col[1]
Below is the code:
df_temp=df_temp.withColumn('test',test(F.col('col_a')))
I am not getting the value from the col_a column when I pass it to the UDF. Can anyone explain this?
CodePudding user response:
It's because your map does not have anything at key=1.
df_temp = spark.createDataFrame([(100,),(101,),(102,)],['CUSTOMER_ID']) \
.withColumn('col_a', F.create_map(F.lit(0.0), F.lit(1.0)))
df_temp.show()
# ----------- ------------
# |CUSTOMER_ID| col_a|
# ----------- ------------
# | 100|{0.0 -> 1.0}|
# | 101|{0.0 -> 1.0}|
# | 102|{0.0 -> 1.0}|
# ----------- ------------
df_temp = df_temp.withColumn('col_a_0', F.col('col_a')[0])
df_temp = df_temp.withColumn('col_a_1', F.col('col_a')[1])
df_temp.show()
# ----------- ------------ ------- -------
# |CUSTOMER_ID| col_a|col_a_0|col_a_1|
# ----------- ------------ ------- -------
# | 100|{0.0 -> 1.0}| 1.0| null|
# | 101|{0.0 -> 1.0}| 1.0| null|
# | 102|{0.0 -> 1.0}| 1.0| null|
# ----------- ------------ ------- -------