I have dataframe in which some column start with number and when I create new column with dict of all column and row value using below code. I am not geeting correct output.
Name | Date | 1Y | 2Y |
---|---|---|---|
XYZ | 2022-05-22 | 0.0898 | 0.9090 |
After running this code
df = df.withColumn("map", F.expr("map(" ",".join([f"'{c}', {c}" for c in df.columns]) ")"))
I am getting
Name | Date | 1Y | 2Y | map |
---|---|---|---|---|
XYZ | 2022-05-22 | 0.0898 | 0.9090 | {"Name":"XYZ","Date":2022-05-22","1Y":"1","2Y":"2"} |
but when I rename the columns which start with number like 1Y to Y1 and 2Y to Y2 I am geeting the correct output like below
Name | Date | Y1 | Y2 | map |
---|---|---|---|---|
XYZ | 2022-05-22 | 0.0898 | 0.9090 | {"Name":"XYZ","Date":2022-05-22","Y1":"0.0898","Y2":"0.9090"} |
is there any way to do this without renaming the column? I have many columns starting with numbers.
CodePudding user response:
Use backticks, in SQL, while providing columns starting with numbers or special characters.
data_sdf. \
withColumn("map", func.expr("map(" ", ".join([f"'{c}', `{c}`" for c in data_sdf.columns]) ")")). \
show(truncate=False)
# ---- ---------- ------ ----- ------------------------------------------------------------
# |name|date |1Y |2Y |map |
# ---- ---------- ------ ----- ------------------------------------------------------------
# |XYZ |2022-05-22|0.0898|0.909|{name -> XYZ, date -> 2022-05-22, 1Y -> 0.0898, 2Y -> 0.909}|
# ---- ---------- ------ ----- ------------------------------------------------------------
Without the backticks, the map expression would look like following
map('name', name, 'date', date, '1Y', 1Y, '2Y', 2Y)
Within an expr()
this is executed as SQL, and thus the 1Y
is considered as the number 1
.
Using backticks, the map expression looks like following
map('name', `name`, 'date', `date`, '1Y', `1Y`, '2Y', `2Y`)
Alternatively, you could also use the create_map
function.
data_sdf. \
withColumn('map',
func.create_map(*reduce(lambda x, y: x y,
[[func.lit(c), func.col(c)] for c in data_sdf.columns]
)
)
). \
show(truncate=False)
# ---- ---------- ------ ----- ------------------------------------------------------------
# |name|date |1Y |2Y |map |
# ---- ---------- ------ ----- ------------------------------------------------------------
# |XYZ |2022-05-22|0.0898|0.909|{name -> XYZ, date -> 2022-05-22, 1Y -> 0.0898, 2Y -> 0.909}|
# ---- ---------- ------ ----- ------------------------------------------------------------