Saving a gzip file as a table in Databricks-CodePudding

I would like to save a gzip file as a Hive table in Databricks, via the below PySpark commands:

df = spark.read.csv(".../Papers.txt.gz", sep="\t")
df.write.saveAsTable("...")

The gzip file Papers.txt.gz weights about 60GBs when unzipped (and it is a large .txt file, actually taken from here) and the Spark cluster is fairly large (850GB, 112 cores).

The problem is that it takes a very large amount of time until this is saved as a table (above 20 mins), making me to abort the operation out of fear that I will bring the cluster down.

The request seems pretty standard, but, is there something that I should be careful here?

Thank you in advance.

CodePudding user response：

The problem is that the gzip files aren't splittable (by default), so all processing of this file happens only by one machine. So the cluster size won't help here much.

If you can un-gzip the files outside & put it uncompressed onto DBFS, then Spark will be able to read it in chunks & parallelize processing. This could be done in the "normal" Databricks directly in the notebook (it may take a lot of time, but you can use single node cluster just to avoid paying for a big cluster):

%sh
gzip -d /dbfs/path_to/Papers.txt.gz

P.S. You can read more about this issue in the following answer