Home > database >  How to subtract all column values of two PySpark dataframe?
How to subtract all column values of two PySpark dataframe?

Time:02-02

Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this: df1:

col1 col2 ... col100
 1    2   ...  100

df2:

col1 col2 ... col100
5     4   ...  20

And I want to get the final dataframe with df1 - df2 : new df:

col1 col2  ... col100
-4     -2  ...   80

I checked the possible solution is subtract two column like:

new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])

But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics? Any answers are super appriciate!

for 101 columns how to simply traverse all column and subtract its values?

CodePudding user response:

You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here's one way to do it in PySpark:

columns = df1.columns

for col in columns:
    df1 = df1.withColumn(col, df1[col] - df2[col])

This will create a new dataframe with the subtracted values for each column.

CodePudding user response:

Within a single select with a python list comprehension :

columns = df1.columns

df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))
  • Related