I am trying to unify two distincts notebooks into a single one by adapting the logic of construction of two different tables.
The first one reads:
spark.sql(''' SELECT CD_CLI,
MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
'2017-01-31' AS DT_MVTC
FROM vl_rptd_utzo
WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('vl_rptd_max_utzo_2017_01_31')
And the second one:
spark.sql('''SELECT CD_CLI,
CASE WHEN SUM(in_lim_crt) > 0
THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
ELSE -99999999999
END AS VL_MED_FAT,
'2017-01-31' as DT_MVTC
FROM in_lim_fat
WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('media_vl_fatura_2017_01_31')
My, perhaps naive?, approach was to union the two selects since they are calling the same fields from the same origin:
spark.sql('''SELECT CD_CLI,
CASE WHEN SUM(in_lim_crt) > 0
THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
ELSE -99999999999
END AS VL_MED_FAT,
'2017-01-31' as DT_MVTC
FROM in_lim_fat
WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
UNION
SELECT CD_CLI,
MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
'2017-01-31' AS DT_MVTC
FROM vl_rptd_utzo
WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('new_table')
But when I ask for a describe:
spark.sql('describe new_table').show(10, False)
The output is:
---------- ------------- -------
|col_name |data_type |comment|
---------- ------------- -------
|CD_CLI |int |null |
|VL_MED_FAT|decimal(38,4)|null |
|DT_MVTC |string |null |
---------- ------------- -------
Why MAX_VL_RPTD_UTZO is not showing up in the new table? I am new to sql and maybe this is very naive and simple but I cannot solve it.
CodePudding user response:
Your first select has CD_CLI
, VL_MED_FAT
and DT_MVTC
Your second select has CD_CLI
, MAX_VL_RPTD_UTZO
and DT_MVTC
Spark will use the column name of the first query as the schema and apply it to other subsequence queries in the union, therefore values of MAX_VL_RPTD_UTZO
will appear in VL_MED_FAT
.
Edit #1: If you want to have 4 columns then it has to be consistence between 2 queries, so something like this
select CD_CLI, VL_MED_FAT, null as MAX_VL_RPTD_UTZO, DT_MVTC from ...
union
select CD_CLI, null as VL_MED_FAT, MAX_VL_RPTD_UTZO, DT_MVTC from ...