So, I'm having a hard time here with a tricky task on pyspark. I need to create a new dataframe with the data from the two dataframes below. The first one is called app_daily_users
:
DATE | APP_1 | APP_2 | APP_3 | APP_4 | APP_5 |
---|---|---|---|---|---|
2020-01-01 | 105190 | 1000 | 100140 | 230380 | 167456 |
2020-01-02 | 91170 | 5000 | 102103 | 228988 | 171698 |
2020-01-03 | 79110 | 4000 | 412130 | 215554 | 214412 |
2020-01-04 | 130859 | 4000 | 61660 | 331125 | 335510 |
The second one is called correction
:
DATE | CORRECTION_INDEX |
---|---|
2020-01-01 | 0.458 |
2020-01-02 | 0.589 |
2020-01-03 | 0.988 |
2020-01-04 | 0.477 |
I need to multiply the each column on the "app daily users" by the correction index on the "correction" dataframe. So in the end I'd have something like this (just a quick example):
DATE | APP_1 | APP_2 | APP_3 | APP_4 | APP_5 |
---|---|---|---|---|---|
2020-01-01 | 48.180 | 458 | 45864 | 105514 | 76694 |
The values on the columns above are the values on the first dataframe on the jan 1st 2020 times 0.458
with is the correction index for that day. Can you guys help me with this?
Thank you!
CodePudding user response:
Simply join on DATE
then use list comprehension with select expression to apply the multiplication:
from pyspark.sql import functions as F
result = app_daily_users.join(correction, ["DATE"], "left").select(
"DATE",
*[(F.col(c) * F.col("CORRECTION_INDEX")).alias(c)
for c in app_daily_users.columns if c != "DATE"
]
)