Home > Mobile >  How do I workaround the 5GB s3 copy limit with pyspark/hive?
How do I workaround the 5GB s3 copy limit with pyspark/hive?

Time:03-09

I am trying to run a spark sql job against an EMR cluster. My create table operation contains many columns but I'm getting an s3 error:

 The specified copy source is larger than the maximum allowable size for a copy source: 5368709120

Is there a hive/spark/pyspark setting that can be set so that _temporary files do not reach that 5GB threshold to write to s3?

This is working: (only 1 column)

create table as select b.column1 from table a left outer join verysmalltable b on ...

This is not working: (many columns)

create table as select b.column1, a.* from table a left outer join verysmalltable b on ...

In both cases, select statements alone work. (see below)

Working:

select b.column1 from table a left outer join verysmalltable b on ...

select b.column1, a.* from table a left outer join verysmalltable b on ...

I'm wondering if memory related - but am unsure. I would think I'd run into a memory error before running into a copy error if it was a memory error (also assuming that the select statement with multiple columns would not work if it was a memory issue)

Only when create table is called do I run into the s3 error. I don't have the option of not using s3 for saving tables and was wondering if there was a way around this issue. The 5GB limit seems to be a hard limit. If anyone has any information about what I can do on the hive/spark end, it would be greatly appreciated.

I'm wondering if there is a specific setting that can be included in the spark-defaults.conf file to limit the size of temporary files.

Extra information: the _temporary file is 4.5 GB after the error occurs.

CodePudding user response:

In the past few months, something changed with how s3 is using the parameter

fs.s3a.multipart.threshold

This setting needs to be under 5G for queries of a certain size to work. Previously I had this setting set at a large number in order to save larger files, but apparently the behavior for this has changed.

The default value for this setting is 2GB. In the spark documentation, there are multiple different definitions based on the hadoop version being used.

  • Related