Why am I not getting all columns from this Union?-CodePudding

I am trying to unify two distincts notebooks into a single one by adapting the logic of construction of two different tables.

The first one reads:

spark.sql(''' SELECT CD_CLI, 
                 MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
                 '2017-01-31' AS DT_MVTC
          FROM vl_rptd_utzo
          WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
          GROUP BY CD_CLI                                                                                           
      ''').createOrReplaceTempView('vl_rptd_max_utzo_2017_01_31')

And the second one:

spark.sql('''SELECT CD_CLI,
                CASE WHEN SUM(in_lim_crt) > 0 
                    THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
                    ELSE -99999999999
                    END AS VL_MED_FAT,
                '2017-01-31' as DT_MVTC
         FROM in_lim_fat
         WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
         GROUP BY CD_CLI
      ''').createOrReplaceTempView('media_vl_fatura_2017_01_31')

My, perhaps naive?, approach was to union the two selects since they are calling the same fields from the same origin:

spark.sql('''SELECT CD_CLI,
                CASE WHEN SUM(in_lim_crt) > 0 
                    THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
                    ELSE -99999999999
                    END AS VL_MED_FAT,
                '2017-01-31' as DT_MVTC
         FROM in_lim_fat
         WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
         GROUP BY CD_CLI
         UNION
         SELECT CD_CLI,
                 MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
                 '2017-01-31' AS DT_MVTC
         FROM vl_rptd_utzo
         WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
         GROUP BY CD_CLI
      ''').createOrReplaceTempView('new_table')

But when I ask for a describe:

spark.sql('describe new_table').show(10, False)

The output is:

 ---------- ------------- ------- 
|col_name  |data_type    |comment|
 ---------- ------------- ------- 
|CD_CLI    |int          |null   |
|VL_MED_FAT|decimal(38,4)|null   |
|DT_MVTC   |string       |null   |
 ---------- ------------- -------

Why MAX_VL_RPTD_UTZO is not showing up in the new table? I am new to sql and maybe this is very naive and simple but I cannot solve it.

CodePudding user response：

Your first select has CD_CLI, VL_MED_FAT and DT_MVTC

Your second select has CD_CLI, MAX_VL_RPTD_UTZO and DT_MVTC

Spark will use the column name of the first query as the schema and apply it to other subsequence queries in the union, therefore values of MAX_VL_RPTD_UTZO will appear in VL_MED_FAT.

Edit #1: If you want to have 4 columns then it has to be consistence between 2 queries, so something like this

select CD_CLI, VL_MED_FAT, null as MAX_VL_RPTD_UTZO, DT_MVTC from ...
union
select CD_CLI, null as VL_MED_FAT, MAX_VL_RPTD_UTZO, DT_MVTC from ...