Count of unique combinations of values in selected columns-CodePudding

I have a PySpark data frame that looks like this:

----------------------------
id    A    B    C 
id1   on   on   on
id1   on   off  on 
id1   on   on   on 
id1   on   on   on 
id1   on   on   off
-----------------------------

I am looking for a way to find all unique combinations for selected columns and show their count. The expected output:

----------------------------
id    A    B    C    count
id1   on   on   on   3
id1   on   off  on   1
id1   on   on   off  1
-----------------------------

I see that there is a way to do a similar operation in Pandas, but I need PySpark.

UPD: Also, please note that a unique combination of columns A and B is not the same as a combination of A,B,C. I want all possible combination of every column. Is there a way to achieve it rather than grouping by and counting one combination, another combination, etc.? There are more that 10 columns.

CodePudding user response：

cube can do it. But it displays ALL combinations including if some columns were not taken into account. So you will have to filter afterwards.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('id1', 'on', 'on', 'on'),
     ('id1', 'on', 'off', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'off')],
    ['id', 'A', 'B', 'C'])

df = df.cube(df.columns).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
#  --- --- --- --- ----- 
# | id|  A|  B|  C|count|
#  --- --- --- --- ----- 
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
# |id1| on| on|off|    1|
#  --- --- --- --- -----

This would count occurrences just in specified columns:

cols = ['A', 'B']
df = df.cube(cols).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
#  --- --- ----- 
# |  A|  B|count|
#  --- --- ----- 
# | on|off|    1|
# | on| on|    4|
#  --- --- -----

CodePudding user response：

Solution:

df = spark.createDataFrame(
    [
        ("id1", "on","on","on"),  # create your data here, be consistent in the types.
        ("id1", "on","off","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","off"),
    ],
    ["id", "A" , "B" , "C"]  # add your column names here
)

Apart from Cube function, we also have Rollup function.

cube: It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns.

rollup: A similar function to cube is rollup which computes hierarchical subtotals from left to right. With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling columns from left to right.

from pyspark.sql import functions as F

df = df.rollup(df.columns).count()

df1 = df.na.drop(subset=df.columns)

df1.show()

# --- --- --- --- ----- 
# | id|  A|  B|  C|count|
#  --- --- --- --- ----- 
# |id1| on| on|off|    1|
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
#  --- --- --- --- -----