Home > Back-end >  How to multiply the rows on a same column in a pyspark dataframe based on date/month?
How to multiply the rows on a same column in a pyspark dataframe based on date/month?

Time:04-01

So basically I have a dataframe with two rows: the number of users on the first and the conversion rate for them on the second, like the table shown here:

df.show()
 --------- ---------- ---------- ----------                                        
| month   | company_1| company_2| company_3|
 --------- ---------- ---------- ---------- 
| 02-2022 | 1000     | 5000     | 500      |
| 02-2022 | 0.08     | 0.13     | 0.45     |
 --------- ---------- ---------- ---------- 

I need to multiply users by conversion rate, so I can have the number of clients who bought something in that company each month, like the table shown here:

df.show()
 --------- ---------- ---------- ----------                                        
| month   | company_1| company_2| company_3|
 --------- ---------- ---------- ---------- 
| 02-2022 | 80       | 650      | 225      |
 --------- ---------- ---------- ---------- 

I don't know how I can do that, could you guys help me please?

Thanks!

CodePudding user response:

Use groupby with product here:

Docs: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.product.html

df.groupBy("month").agg(*[F.product(col).alias(col) for col in cols]).show()

 ------- --------- --------- --------- 
|  month|company_1|company_2|company_3|
 ------- --------- --------- --------- 
|02-2022|     80.0|    650.0|    225.0|
 ------- --------- --------- --------- 

For older versions of pyspark, you can do something similar to this using higher order functions available from spark 2.4

cols = [col for col in df.columns if col!= "month"]
out = df.groupBy("month").agg(*[F.expr(f"""aggregate(collect_list({col}),
                                cast(1 as double),(value, acc) -> value * acc, 
                                acc -> acc) as {col}""") for col in cols])

out.show()
 ------- --------- --------- --------- 
|  month|company_1|company_2|company_3|
 ------- --------- --------- --------- 
|02-2022|     80.0|    650.0|    225.0|
 ------- --------- --------- --------- 
  • Related