How can i make my Spark Accumulator statistics reliable in Azure Databricks?-CodePudding

I am using a spark accumulator to collect statistics of each pipelines.

In a typical pipeline i would read a data_frame :

df = spark.read.format(csv).option("header",'true').load('/mnt/prepared/orders')
df.count() ==> 7 rows

Then i would actually write it in two diferent locations:

df.write.format(delta).option("header",'true').load('/mnt/prepared/orders')
df.write.format(delta).option("header",'true').load('/mnt/reporting/orders_current/')

Unfortunately my accumulator statistics get updated each write operations. It gives a figure of 14 rows read, while i have only read the input dataframe once.

How can I make my accumulator properly reflects the number of rows that i actually read.

I am a newbie in spark. have checked several threads around the issue, but did not find my answer. Statistical accumulator in Python spark Accumulator reset When are accumulators truly reliable?

CodePudding user response：

The first rule - accumulators aren't 100% reliable. They could be updated multiple times, for example, if tasks were restarted/retried.

In your case, although you read once, it doesn't mean that data won't be re-read again. Read operation just obtains metadata, like, schema, and may read data if you use inferSchema for some data type, but it doesn't mean that it's actually read the data into memory. You can cache your read dataframe, but it will work only for smaller data sets, as it's also don't guarantee that data won't be evicted, and then need to be re-read