So basically I have a dataframe with two rows: the number of users on the first and the conversion rate for them on the second, like the table shown here:
df.show()
--------- ---------- ---------- ----------
| month | company_1| company_2| company_3|
--------- ---------- ---------- ----------
| 02-2022 | 1000 | 5000 | 500 |
| 02-2022 | 0.08 | 0.13 | 0.45 |
--------- ---------- ---------- ----------
I need to multiply users by conversion rate, so I can have the number of clients who bought something in that company each month, like the table shown here:
df.show()
--------- ---------- ---------- ----------
| month | company_1| company_2| company_3|
--------- ---------- ---------- ----------
| 02-2022 | 80 | 650 | 225 |
--------- ---------- ---------- ----------
I don't know how I can do that, could you guys help me please?
Thanks!
CodePudding user response:
Use groupby with product here:
Docs: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.product.html
df.groupBy("month").agg(*[F.product(col).alias(col) for col in cols]).show()
------- --------- --------- ---------
| month|company_1|company_2|company_3|
------- --------- --------- ---------
|02-2022| 80.0| 650.0| 225.0|
------- --------- --------- ---------
For older versions of pyspark, you can do something similar to this using higher order functions available from spark 2.4
cols = [col for col in df.columns if col!= "month"]
out = df.groupBy("month").agg(*[F.expr(f"""aggregate(collect_list({col}),
cast(1 as double),(value, acc) -> value * acc,
acc -> acc) as {col}""") for col in cols])
out.show()
------- --------- --------- ---------
| month|company_1|company_2|company_3|
------- --------- --------- ---------
|02-2022| 80.0| 650.0| 225.0|
------- --------- --------- ---------