Home > database >  Creating a dataframe by multiplying the columns of other 2 dataframes on Pyspark
Creating a dataframe by multiplying the columns of other 2 dataframes on Pyspark

Time:02-22

So, I'm having a hard time here with a tricky task on pyspark. I need to create a new dataframe with the data from the two dataframes below. The first one is called app_daily_users:

DATE APP_1 APP_2 APP_3 APP_4 APP_5
2020-01-01 105190 1000 100140 230380 167456
2020-01-02 91170 5000 102103 228988 171698
2020-01-03 79110 4000 412130 215554 214412
2020-01-04 130859 4000 61660 331125 335510

The second one is called correction:

DATE CORRECTION_INDEX
2020-01-01 0.458
2020-01-02 0.589
2020-01-03 0.988
2020-01-04 0.477

I need to multiply the each column on the "app daily users" by the correction index on the "correction" dataframe. So in the end I'd have something like this (just a quick example):

DATE APP_1 APP_2 APP_3 APP_4 APP_5
2020-01-01 48.180 458 45864 105514 76694

The values on the columns above are the values on the first dataframe on the jan 1st 2020 times 0.458 with is the correction index for that day. Can you guys help me with this?

Thank you!

CodePudding user response:

Simply join on DATE then use list comprehension with select expression to apply the multiplication:

from pyspark.sql import functions as F

result = app_daily_users.join(correction, ["DATE"], "left").select(
    "DATE",
    *[(F.col(c) * F.col("CORRECTION_INDEX")).alias(c)
      for c in app_daily_users.columns if c != "DATE"
      ]
)
  • Related