spark.sql(f"""
INSERT INTO {databaseName}.{tableName}
SELECT
'{runDate}'
, client_id
, COUNT(DISTINCT client_id) AS distinct_count_client_id
FROM df """)
So say I have column of client_id with duplicate values and I'm trying to have a column of aggregated distinct count of the client ids, how would I acheive that in pyspark? This above code doesn't work.
CodePudding user response:
try this code :
spark.sql(f"""
INSERT INTO {databaseName}.{tableName}
SELECT
'{runDate}'
, client_id
, COUNT(*) AS client_id
group by client_id HAVING COUNT(client_id) > 1
FROM df """)
CodePudding user response:
You can use size
and collect_set
functions to implement count distinct function.
spark.sql(f"""
insert into {databaseName}.{tableName}
select
'{runDate}'
,client_id
,size(collect_set(client_id) over (partition by null)) as distinct_count_client_id
from df
"""
)