Home > OS >  How to aggregate on the distinct count of a column in spark SQL that put it as a new column?
How to aggregate on the distinct count of a column in spark SQL that put it as a new column?

Time:03-09

spark.sql(f""" 
          INSERT INTO {databaseName}.{tableName} 
          SELECT 
              '{runDate}'
            , client_id
            , COUNT(DISTINCT client_id) AS distinct_count_client_id
          FROM df """) 

So say I have column of client_id with duplicate values and I'm trying to have a column of aggregated distinct count of the client ids, how would I acheive that in pyspark? This above code doesn't work.

CodePudding user response:

try this code :

spark.sql(f""" 
      INSERT INTO {databaseName}.{tableName} 
      SELECT 
          '{runDate}'
        , client_id
        , COUNT(*) AS client_id
        group by client_id HAVING COUNT(client_id) > 1
      FROM df """) 

CodePudding user response:

You can use size and collect_set functions to implement count distinct function.

spark.sql(f""" 
          insert into {databaseName}.{tableName} 
          select 
              '{runDate}'
              ,client_id
              ,size(collect_set(client_id) over (partition by null)) as distinct_count_client_id
          from df
          """
)
  • Related