Creating a dataframe by multiplying the columns of other 2 dataframes on Pyspark-CodePudding

So, I'm having a hard time here with a tricky task on pyspark. I need to create a new dataframe with the data from the two dataframes below. The first one is called app_daily_users:

DATE	APP_1	APP_2	APP_3	APP_4	APP_5
2020-01-01	105190	1000	100140	230380	167456
2020-01-02	91170	5000	102103	228988	171698
2020-01-03	79110	4000	412130	215554	214412
2020-01-04	130859	4000	61660	331125	335510

The second one is called correction:

DATE	CORRECTION_INDEX
2020-01-01	0.458
2020-01-02	0.589
2020-01-03	0.988
2020-01-04	0.477

I need to multiply the each column on the "app daily users" by the correction index on the "correction" dataframe. So in the end I'd have something like this (just a quick example):

DATE	APP_1	APP_2	APP_3	APP_4	APP_5
2020-01-01	48.180	458	45864	105514	76694

The values on the columns above are the values on the first dataframe on the jan 1st 2020 times 0.458 with is the correction index for that day. Can you guys help me with this?

Thank you!

CodePudding user response：

Simply join on DATE then use list comprehension with select expression to apply the multiplication:

from pyspark.sql import functions as F

result = app_daily_users.join(correction, ["DATE"], "left").select(
    "DATE",
    *[(F.col(c) * F.col("CORRECTION_INDEX")).alias(c)
      for c in app_daily_users.columns if c != "DATE"
      ]
)