How do I count based on different rows conditions in PySpark?-CodePudding

I have the following Dataframe:

ID	Payment	Value	Date
1	Cash	200	2020-01-01
1	Credit Card	500	2020-01-06
2	Cash	300	2020-02-01
3	Credit Card	400	2020-02-02
3	Credit Card	500	2020-01-03
3	Cash	200	2020-01-04

What I'd like to do is to count how many ID's have used both Cash and Credit Card.

For example, in this case there would be 2 ID's that used both Cash and Credit Card.

How would I do that on PySpark?

CodePudding user response：

You can use collect_set to count how many payment methods each user has.

from pyspark.sql import functions as F

(df
    .groupBy('ID')
    .agg(F.collect_set('Payment').alias('methods'))
    .withColumn('methods_size', F.size('methods'))
    .show()
)

#  --- ------------------- ------------ 
# | ID|            methods|methods_size|
#  --- ------------------- ------------ 
# |  1|[Credit Card, Cash]|           2|
# |  3|[Credit Card, Cash]|           2|
# |  2|             [Cash]|           1|
#  --- ------------------- ------------