I have 2 DataFrames with hundreds of columns. Df1
looks like this:
id | col1 | col2 | col3 | .....
1 .2 .3 .3
2 .1 .4 .2
....
Df2
looks like this, and only has 1 row of values:
col1 | col2 | col3 | .....
.2 .3 .3
I'd like to divide each row of Df1
by Df2
, so I should end up with something like this:
id | col1 | col2 | col3 | .....
1 .2/.2 .3/.3 .3/.3
2 .1/.2 .4/.3 .2/.3
How can I do this without specifically specifying column names during a join, given I have hundreds of columns? Thanks in advance!
CodePudding user response:
I got value of df2 and zipped it with df1. Then iterated through zipped structure, and got division value. Hope this helps. Here is the code snippet and output I got.
from pyspark.sql.functions import col
df1 = spark.createDataFrame( [('A',2,4),('B',6,8), ('C',10,12) ],['col1','col2','col3'] )
df2 = spark.createDataFrame( [(2,2)],['div1','div2'] )
df1.show()
df2.show()
lr = df2.rdd.take(1)
for c, v in zip(df1.columns[1:], lr[0]):
df1 = df1.withColumn(c, col(c)/v)
df1.show()
---- ---- ----
|col1|col2|col3|
---- ---- ----
| A| 2| 4|
| B| 6| 8|
| C| 10| 12|
---- ---- ----
---- ----
|div1|div2|
---- ----
| 2| 2|
---- ----
---- ---- ----
|col1|col2|col3|
---- ---- ----
| A| 1.0| 2.0|
| B| 3.0| 4.0|
| C| 5.0| 6.0|
---- ---- ----