Of the various ways that you've tried, e.g. df.select('column').distinct(), df.groupby('column').count() etc., what is the most efficient way to extract distinct value from a column?
CodePudding user response:
for larger dataset , groupby is efficient method.
CodePudding user response:
It does not matter as you can see in this excellent reference https://www.waitingforcode.com/apache-spark-sql/distinct-vs-group-by-key-difference/read.
This is because Apache Spark has a logical optimization rule called ReplaceDistinctWithAggregate that will transform an expression with distinct keyword by an aggregation.
DISTINCT and GROUP BY in simple contexts of selecting unique values for a column, execute the same way, i.e. as an aggregation.