Home > Software design >  Small files available in ADLS Gen2 location even after delta optimization
Small files available in ADLS Gen2 location even after delta optimization

Time:04-19

I am having a comparatively big external delta table which have data on ADLS gen2 location. This table is partitioned by id and signal_date

I am running a delta optimize query on this table on weekly basis. The query is as shown below.

enter image description here

For few partitions we could see optimization runs more than 2 hrs as highlighted above.

And for the entire partitions for a week, this job is running more than 48 hrs and have to kill it in between in order to proceed with data load into this table. (otherwise encounter with concurrent operation failure error on the partition)

Even after this optimization I could see many small files in the ADLS location for that partition as shown below.

enter image description here enter image description here

Is there anything wrong happening during the optimization considering the huge time for execution and availability of small files after optimization?

Any thoughts/points appreciated!

Thanks In Advance.

CodePudding user response:

OPTIMIZE doesn't remove the small files after it's finished - it creates a new version that refers new, bigger files, and marks old small files as removed, but the actual removal will happen when you run VACUUM.

Here an example of the transaction log for OPTIMIZE performed:

{"add":{"path":"part-00000-5046c283-8633-4fe2-8c99-ada26908e2d0-c000.snappy.parquet",
  "partitionValues":{},"size":1105,"modificationTime":1650281913487,
  "dataChange":false,"stats":"{\"numRecords\":200,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":99},\"nullCount\":{\"id\":0}}"}}
{"remove":{"path":"part-00002-57c4253a-5bdc-4d3e-9886-601fa793cdf6-c000.snappy.parquet",
  "deletionTimestamp":1650281912032,"dataChange":false,
  "extendedFileMetadata":true,"partitionValues":{},"size":536}}
...other remove entries...
{"commitInfo":{"timestamp":1650281913514,"operation":"OPTIMIZE",
  "operationParameters":{"predicate":"[\"true\"]"},"readVersion":1,
  "isolationLevel":"SnapshotIsolation","isBlindAppend":false,
  "operationMetrics":{"numRemovedFiles":"16","numRemovedBytes":"8640","p25FileSize":"1105","minFileSize":"1105","numAddedFiles":"1","maxFileSize":"1105","p75FileSize":"1105","p50FileSize":"1105","numAddedBytes":"1105"},
  "engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.0",
  "txnId":"31a68055-7932-4266-8290-9939af7e6a84"}}
  • Related