Home > Mobile >  Divide 2 PySpark DataFrames Based on Column Names
Divide 2 PySpark DataFrames Based on Column Names

Time:05-28

I have 2 DataFrames with hundreds of columns. Df1 looks like this:

id | col1 | col2 | col3 | ..... 
1     .2     .3     .3
2     .1     .4     .2
....

Df2 looks like this, and only has 1 row of values:

col1 | col2 | col3 | ..... 
.2     .3     .3

I'd like to divide each row of Df1 by Df2, so I should end up with something like this:

id | col1 | col2 | col3 | ..... 
1   .2/.2  .3/.3  .3/.3
2   .1/.2  .4/.3  .2/.3

How can I do this without specifically specifying column names during a join, given I have hundreds of columns? Thanks in advance!

CodePudding user response:

I got value of df2 and zipped it with df1. Then iterated through zipped structure, and got division value. Hope this helps. Here is the code snippet and output I got.

from pyspark.sql.functions import col
df1 = spark.createDataFrame( [('A',2,4),('B',6,8), ('C',10,12) ],['col1','col2','col3'] )
df2 = spark.createDataFrame( [(2,2)],['div1','div2'] )
df1.show()
df2.show()

lr = df2.rdd.take(1)
for c, v in zip(df1.columns[1:], lr[0]):
    df1 = df1.withColumn(c, col(c)/v)
df1.show()

 ---- ---- ---- 
|col1|col2|col3|
 ---- ---- ---- 
|   A|   2|   4|
|   B|   6|   8|
|   C|  10|  12|
 ---- ---- ---- 

 ---- ---- 
|div1|div2|
 ---- ---- 
|   2|   2|
 ---- ---- 

 ---- ---- ---- 
|col1|col2|col3|
 ---- ---- ---- 
|   A| 1.0| 2.0|
|   B| 3.0| 4.0|
|   C| 5.0| 6.0|
 ---- ---- ---- 
  • Related