Home > Software design >  Column start with number in Pyspark not providing Correct output
Column start with number in Pyspark not providing Correct output

Time:09-15

I have dataframe in which some column start with number and when I create new column with dict of all column and row value using below code. I am not geeting correct output.

Name Date 1Y 2Y
XYZ 2022-05-22 0.0898 0.9090

After running this code

df = df.withColumn("map", F.expr("map("   ",".join([f"'{c}', {c}" for c in df.columns])   ")")) 

I am getting

Name Date 1Y 2Y map
XYZ 2022-05-22 0.0898 0.9090 {"Name":"XYZ","Date":2022-05-22","1Y":"1","2Y":"2"}

but when I rename the columns which start with number like 1Y to Y1 and 2Y to Y2 I am geeting the correct output like below

Name Date Y1 Y2 map
XYZ 2022-05-22 0.0898 0.9090 {"Name":"XYZ","Date":2022-05-22","Y1":"0.0898","Y2":"0.9090"}

is there any way to do this without renaming the column? I have many columns starting with numbers.

CodePudding user response:

Use backticks, in SQL, while providing columns starting with numbers or special characters.

data_sdf. \
    withColumn("map", func.expr("map("   ", ".join([f"'{c}', `{c}`" for c in data_sdf.columns])   ")")). \
    show(truncate=False)

#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |name|date      |1Y    |2Y   |map                                                         |
#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |XYZ |2022-05-22|0.0898|0.909|{name -> XYZ, date -> 2022-05-22, 1Y -> 0.0898, 2Y -> 0.909}|
#  ---- ---------- ------ ----- ------------------------------------------------------------ 

Without the backticks, the map expression would look like following

map('name', name, 'date', date, '1Y', 1Y, '2Y', 2Y)

Within an expr() this is executed as SQL, and thus the 1Y is considered as the number 1.

Using backticks, the map expression looks like following

map('name', `name`, 'date', `date`, '1Y', `1Y`, '2Y', `2Y`)

Alternatively, you could also use the create_map function.

data_sdf. \
    withColumn('map', 
               func.create_map(*reduce(lambda x, y: x   y, 
                                       [[func.lit(c), func.col(c)] for c in data_sdf.columns]
                                       )
                               )
               ). \
    show(truncate=False)

#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |name|date      |1Y    |2Y   |map                                                         |
#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |XYZ |2022-05-22|0.0898|0.909|{name -> XYZ, date -> 2022-05-22, 1Y -> 0.0898, 2Y -> 0.909}|
#  ---- ---------- ------ ----- ------------------------------------------------------------ 
  • Related