Home > Software engineering >  Most optimal method to check length of a parquet table in dbfs with pyspark?
Most optimal method to check length of a parquet table in dbfs with pyspark?

Time:11-10

I have a table on dbfs I can read with pyspark, but I only need to know the length of it (nrows). I know I could just read the file and do a table.count() to get it, but that would take some time.

Is there a better way to solve this?

CodePudding user response:

I am afraid not.

Since you are using dbfs, I suppose you are using Delta format with Databricks. So, theoretically, you could check the metastore, but:

The metastore is not the source of truth about the latest information of a Delta table

https://docs.delta.io/latest/delta-batch.html#control-data-location

  • Related