How to convert Parquet file to Delta file-CodePudding

I am trying to convert parquet file into delta file in the same directory.

When i change the directory files gets created, but when i try to create delta file in same directory it doesn't work.

Logs that gets created include only commits.

{"commitInfo":{"timestamp":1639462569886,"userId":"1873721116118433","userName":"removed!!","operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"["Buyer_Partner_Code"]"},"notebook":{"notebookId":"3864076797603349"},"clusterId":"0713-055328-sonar10","readVersion":0,"isolationLevel":"SnapshotIsolation","isBlindAppend":true,"operationMetrics":{"numFiles":"0","numOutputBytes":"0","numOutputRows":"0"}}}

df1.write.format("delta").mode("append").save("/data/dbo/csm_currencyratetype/Buyer_Partner_Code=190935/")

CodePudding user response：

Delta uses same files .parquet that you already have but first you should create delta table in order to create the delta log and metadata. Once created, your directory will be a delta table and you can continue append or update data using delta format.

import io.delta.tables._

// Convert unpartitioned Parquet table at path '<path-to-table>'
val deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`")

https://docs.delta.io/latest/delta-utility.html#convert-a-parquet-table-to-a-delta-table

CodePudding user response：

I would register your parquet as a table (you can try directly to save it as delta, if you use parquet you need to go with conversion in second step, please backup your data before that):

%sql
CREATE TABLE buyer USING [DELTA/PARQUET] OPTIONS (path
"/data/dbo/csm_currencyratetype/Buyer_Partner_Code=190935/");

and than use simple sql conversion:

%sql
CONVERT TO DELTA buyer;

Buyer_Partner_Code looks like partition number so I think rather path to table should be "/data/dbo/csm_currencyratetype/"

CodePudding user response：

To Answer this question , we can understand on delta format file in Databricks first. So that we can understand clearly why this issue is happening .

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

In that case , when you write back in same directory both Parquet and delta will get conflict. So You can write it in another directory not in same directory .