Home > Net >  How does spark calculates number of records in a dataframe?
How does spark calculates number of records in a dataframe?

Time:05-18

I know that df.count() will trigger a spark action and return number of records present in a dataframe, but I wanted to know how this process work internally does spark goes through the whole dataframe to count number of records or is there any other optimised technique like storing value in dataframe's metadata?

I am using pyspark 3.2.1.

CodePudding user response:

It appears that underneath the hood running df.count() actually uses the Count aggregation class. I am basing this on the definition of the count method in Dataset.scala.

  /**
   * Returns the number of rows in the Dataset.
   * @group action
   * @since 1.6.0
   */
  def count(): Long = withAction("count", groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0)
  }

is there any other optimised technique like storing value in dataframe's metadata?

It is going to employ all the same optimization strategies used by Catalyst. It creates a directed graph of expressions, evaluates and rolls them up. It is not storing the count value as metadata, which would violate Spark's lazy evaluation principle.

I ran an experiment and verified that df.count() and df.groupBy().count() produce the same physical plan.

df = spark.createDataFrame(pd.DataFrame({"a": [1,2,3], "b": ["a", "b", "c"]}))

# count using the Dataframe method
df.count()

# count using the Count aggregator
cnt_agg = df.groupBy().count()

Both jobs produced the same Physical Plan:

== Physical Plan ==
AdaptiveSparkPlan (9)
 - == Final Plan ==
   * HashAggregate (6)
    - ShuffleQueryStage (5), Statistics(sizeInBytes=64.0 B, rowCount=4, isRuntime=true)
       - Exchange (4)
          - * HashAggregate (3)
             - * Project (2)
                - * Scan ExistingRDD (1)
 - == Initial Plan ==
   HashAggregate (8)
    - Exchange (7)
       - HashAggregate (3)
          - Project (2)
             - Scan ExistingRDD (1)
  • Related