Home > Blockchain >  Pyspark - replace values in column with dictionary
Pyspark - replace values in column with dictionary

Time:12-22

I'm avoiding repeating the .when function 12 times, so I thought about a dictionary. I don't know if it's a limitation of the Spark function or a logic error. Does the function allow this concatenation?

months = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun',
         '7': 'Jul', '8': 'Aug', '9': 'Sep', '10':'Oct', '11': 'Nov', '12':'Dec'}

for num, month in months.items():
    custoDF1 = custoDF.\
    withColumn("Month", 
               when(col("Nummes") == num, month)
               .otherwise(month))   
custoDF1.select(col('Nummes').alias('NumMonth'), 'month').distinct().orderBy("NumMonth").show(200)  

enter image description here

CodePudding user response:

You can use the replace method of the DataFrame class:

import pyspark.sql.functions as F

months = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun',
          '7': 'Jul', '8': 'Aug', '9': 'Sep', '10':'Oct', '11': 'Nov', '12':'Dec'}

df = (df.withColumn('month', F.col('NumMonth').cast('string'))
        .replace(months, subset=['month']))

df.show()
 -------- ----- 
|NumMonth|month|
 -------- ----- 
|       1|  Jan|
|       2|  Feb|
|       3|  Mar|
|       4|  Apr|
|       5|  May|
|       6|  Jun|
|       7|  Jul|
|       8|  Aug|
|       9|  Sep|
|      10|  Oct|
|      11|  Nov|
|      12|  Dec|
 -------- ----- 

Here I had to cast NumMonth to string because your mapping in months dictionary had string keys; alternatively, you can change them to integer and avoid casting to string.

  • Related