hive is failing when joining external and internal tables-CodePudding

Our environment/versions hadoop 3.2.3 hive 3.1.3 spark 2.3.0

our internal table in hive is defined as

CREATE TABLE dw.CLIENT
(
client_id integer,
client_abbrev string,
client_name string,
effective_start_ts timestamp,
effective_end_ts timestamp,
active_flag string,
record_version integer
)
stored as orc tblproperties ('transactional'='true');

external as

CREATE EXTERNAL TABLE  ClientProcess_21
 ( ClientId string, ClientDescription string, IsActive string, OldClientId string, NewClientId string, Description string, 
 TinyName string, FinanceCode string, ParentClientId string, ClientStatus string, FSPortalClientId string,)   
 ROW FORMAT DELIMITED   FIELDS TERMINATED BY ','   STORED AS TEXTFILE   LOCATION '.../client_extract_20220801.csv/' TBLPROPERTIES ("skip.header.line.count"="1")

I can select from both tables.

the internal table is empty, when I try joining them

select
    null, s.*
from  ClientProcess_21  s
join dw.client t
    on s.ClientId = t.client_id

Hive is failing with

SQL Error [3] [42000]: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed during runtime. Please check stacktrace for the root cause. partial stack trace from the Hive log 2022-08-01T18:53:39,012 INFO [RPC-Handler-1] client.SparkClientImpl: Received result for 07a38056-5ba8-45e0-8783-397f25f398cb 2022-08-01T18:53:39,219 ERROR [HiveServer2-Background-Pool: Thread-1667] status.SparkJobMonitor: Job failed with java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions; at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286) at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.(OrcFile.java:113) at org.apache.hadoop.hive.ql.io.orc.OrcFile.writerOptions(OrcFile.java:317) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getOptions(OrcOutputFormat.java:126) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:184) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:61) at org.apache.hadoop.hive.ql.exec.Utilities.createEmptyFile(Utilities.java:3458) at org.apache.hadoop.hive.ql.exec.Utilities.createDummyFileForEmptyPartition(Utilities.java:3489) at org.apache.hadoop.hive.ql.exec.Utilities.access$300(Utilities.java:222) at org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3433) at org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3370) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:318) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:241) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:113) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:359) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:378) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:343) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions; at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286) at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.(OrcFile.java:113) at org.apache.hadoop.hive.ql.io.orc.OrcFile.writerOptions(OrcFile.java:317) at org.apache.hadoop.hive.q

******* update DMLs on tables defined as ..stored as orc tblproperties ('transactional'='true');

are failing with 2022-08-02 09:47:42 ERROR SparkJobMonitor:1250 - Job failed with java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions; java.util.concurrent.ExecutionException: Exception thrown by job ,, Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.222.108.202, executor 0): java.lang.RuntimeException: Error processing row: java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions; at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149) .. Caused by: java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions; at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286)

CodePudding user response：

I think this is related to data type conversation when joining. One join col is string and other is int. Can you please try this

select
    null, s.*
from  ClientProcess_21  s
join dw.client t
    on s.ClientId = cast(t.client_id as string) -- cast it to string

CodePudding user response：

resolved by copying orc jars to spark home cp $HIVE_HOME/lib/orc $SPARK_HOME/jars/ cp $HIVE_HOME/hive-storage-api-2.7.0.jar $SPARK_HOME/jars/