Home > OS >  How do I count based on different rows conditions in PySpark?
How do I count based on different rows conditions in PySpark?

Time:10-29

I have the following Dataframe:

ID Payment Value Date
1 Cash 200 2020-01-01
1 Credit Card 500 2020-01-06
2 Cash 300 2020-02-01
3 Credit Card 400 2020-02-02
3 Credit Card 500 2020-01-03
3 Cash 200 2020-01-04

What I'd like to do is to count how many ID's have used both Cash and Credit Card.

For example, in this case there would be 2 ID's that used both Cash and Credit Card.

How would I do that on PySpark?

CodePudding user response:

You can use collect_set to count how many payment methods each user has.

from pyspark.sql import functions as F

(df
    .groupBy('ID')
    .agg(F.collect_set('Payment').alias('methods'))
    .withColumn('methods_size', F.size('methods'))
    .show()
)

#  --- ------------------- ------------ 
# | ID|            methods|methods_size|
#  --- ------------------- ------------ 
# |  1|[Credit Card, Cash]|           2|
# |  3|[Credit Card, Cash]|           2|
# |  2|             [Cash]|           1|
#  --- ------------------- ------------ 
  • Related