Column start with number in Pyspark not providing Correct output-CodePudding

I have dataframe in which some column start with number and when I create new column with dict of all column and row value using below code. I am not geeting correct output.

Name	Date	1Y	2Y
XYZ	2022-05-22	0.0898	0.9090

After running this code

df = df.withColumn("map", F.expr("map("   ",".join([f"'{c}', {c}" for c in df.columns])   ")"))

I am getting

Name	Date	1Y	2Y	map
XYZ	2022-05-22	0.0898	0.9090	{"Name":"XYZ","Date":2022-05-22","1Y":"1","2Y":"2"}

but when I rename the columns which start with number like 1Y to Y1 and 2Y to Y2 I am geeting the correct output like below

Name	Date	Y1	Y2	map
XYZ	2022-05-22	0.0898	0.9090	{"Name":"XYZ","Date":2022-05-22","Y1":"0.0898","Y2":"0.9090"}

is there any way to do this without renaming the column? I have many columns starting with numbers.

CodePudding user response：

Use backticks, in SQL, while providing columns starting with numbers or special characters.

data_sdf. \
    withColumn("map", func.expr("map("   ", ".join([f"'{c}', `{c}`" for c in data_sdf.columns])   ")")). \
    show(truncate=False)

#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |name|date      |1Y    |2Y   |map                                                         |
#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |XYZ |2022-05-22|0.0898|0.909|{name -> XYZ, date -> 2022-05-22, 1Y -> 0.0898, 2Y -> 0.909}|
#  ---- ---------- ------ ----- ------------------------------------------------------------

Without the backticks, the map expression would look like following

map('name', name, 'date', date, '1Y', 1Y, '2Y', 2Y)

Within an expr() this is executed as SQL, and thus the 1Y is considered as the number 1.

Using backticks, the map expression looks like following

map('name', `name`, 'date', `date`, '1Y', `1Y`, '2Y', `2Y`)

Alternatively, you could also use the create_map function.

data_sdf. \
    withColumn('map', 
               func.create_map(*reduce(lambda x, y: x   y, 
                                       [[func.lit(c), func.col(c)] for c in data_sdf.columns]
                                       )
                               )
               ). \
    show(truncate=False)

#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |name|date      |1Y    |2Y   |map                                                         |
#  ---- ---------- ------ ----- ------------------------------------------------------------ 
# |XYZ |2022-05-22|0.0898|0.909|{name -> XYZ, date -> 2022-05-22, 1Y -> 0.0898, 2Y -> 0.909}|
#  ---- ---------- ------ ----- ------------------------------------------------------------