Empty Spark dataset reading Hive table-CodePudding

There are 2 Hive tables created using the same sources and same logic, but with slightly different queries:

Table 1 query is:

create table test.table1 stored as orc as
    select
        f1,
        mc.f2 as f2,
        mc.f3 as f3,
        f4
    from src.test_table lateral view explode(multiple_field) mcTable as mc
    union all
    select
        f1,
        f5 as f2,
        f6 as f3,
        f4
    from src.test_table
    where multiple_field is null or size(multiple_field) < 1
;

Next, table 2 query - same logic, shortened using outer explode:

 create table test.table2 stored as orc as
     select
       f1,
       if(mc is null, f5, mc.f2) as f2,
       if(mc is null, f6, mc.f3) as f3,
       f4
 from src.test_table lateral view outer explode(multiple_field) mcTable as mc
;

Both tables were created successfully, contain the same number of rows and identical data (checked by Hive Beeline client). Then I try to read table's data with Spark:

SparkSession sparkSession = SparkSession
                .builder().config("hive.execution.engine","mr")
                .appName("OrcExportJob")
                .enableHiveSupport()
                .getOrCreate();

String hql = "select * from test.table1"; // or test.table2
Dataset<Row> sqlDF = sparkSession.sql(hql);

In case of test.table2 it's OK - sqlDF contains all the data. Reading test.table1 leads to different result - sqlDF contains no data at all (0 rows). Spark logs shows no errors - just like the table is really empty.

I've heard Spark has some problems reading transactional or partitioned Hive tables - but this isn't the case.

Digging around I explored that Hive stores ORC files for my tables in different ways:

/
├─ user/
│  ├─ hive/
│  │  ├─ warehouse/
│  │  │  ├─ test.db/
│  │  │  │  ├─ table1/
│  │  │  │  │  ├─ 1/
│  │  │  │  │  │  ├─ 1/
│  │  │  │  │  │  │  ├─ 000000_0
│  │  │  │  │  ├─ 2/
│  │  │  │  │  │  ├─ 000000_0
│  │  │  │  │  │  ├─ 000001_0
│  │  │  │  │  │  ├─ 000002_0
│  │  │  │  │  │  ├─ 000003_0
│  │  │  │  ├─ table2/
│  │  │  │  │  ├─ 000000_0
│  │  │  │  │  ├─ 000001_0
│  │  │  │  │  ├─ 000002_0
│  │  │  │  │  ├─ 000003_0

Could someone help me figure out the reason Spark doesn't see Table 1 data?

Why does Hive keep 5 files with complex directory structure for Table 1 and only 4 files with a simple structure for Table 2?

Could it somehow affect Spark reading process?

P.S. Hive version is 2.3.3, Spark version is 2.4.4

CodePudding user response：

Normally data files are located inside table location without subdirectories.

UNION ALL is being optimized (most probably you are using Tez) and each query is running in parallel, independently as separated mapper tasks. This requires separate subdirectories to be created for each query in UNION ALL to make it possible to write results of each query simultaneously, this is why you have two directories.

These settings allow Hive to read subdirectories:

set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;

There is an issue SPARK-26663 - Cannot query a Hive table with subdirectories - closed as cannot reproduce because they executed steps to reproduce on MR instead of Tez.

If you need to read such tables try to use HiveContext and set above properties.

BTW your second query is more efficient because you are reading source table only once and does not create subdirectories.

Also you can try running you CREATE TABLE on MR, it will not create sub-dirs (set hive.execution.engine=mr;).

Also wrapping UNION ALL into subquery and adding something like DISTRIBUTE BY or ORDER will force additional reduce step, see https://stackoverflow.com/a/54998031/2700344