I have a table that looks like this:
--------- ------- ------ ------ ----------
|cd_cli |vl_ren |max_t0|max_12|dt_mvtc |
--------- ------- ------ ------ ----------
|514208 |1040.00|0 |0 |2017-01-31|
|30230361 |3720.00|0 |0 |2017-01-31|
|201188220|2742.00|0 |0 |2017-01-31|
|204080612|2968.00|0 |0 |2017-01-31|
|209727665|860.00 |0 |0 |2017-01-31|
|212491854|792.00 |0 |0 |2017-01-31|
|300597652|1663.00|0 |0 |2017-01-31|
|300836378|2366.00|0 |0 |2017-01-31|
|301040450|3394.00|0 |0 |2017-01-31|
|302394154|2218.00|0 |0 |2017-01-31|
--------- ------- ------ ------ ----------
And I want to select:
vlren = spark.sql('''select dt_mvtc,
vl_ren,
max_t0,
max_12,
count(cd_cli) as count_cd_cli
from table_xx
group by dt_mvtc,vl_ren,max_t0,max_12
order by dt_mvtc''')
But the group by
is not quite well because the values for vl_ren
are sometimes very close to one another - they can differ by 0.01 - thus I am trying to group them by ranges, but I am not sure how to put the ranges inside the select clause:
%%time
%%spark
vlren = spark.sql('''select dt_mvtc,
vl_ren,
max_t0,
max_12,
count(cd_cli) as count_cd_cli
CASE
WHEN vl_ren >= 0 AND vl_ren < 1000 THEN 0
WHEN vl_ren >= 1000 AND vl_ren < 2000 THEN 1
WHEN vl_ren >= 2000 AND vl_ren < 3000 THEN 2
WHEN vl_ren >= 3000 THEN 3
END AS values
FROM
vl_ren
) AS vl_ren_range
GROUP BY dt_mvtc,vl_ren_range.values,max_12
from sbx_d4n0cbf.renda_presumida
order by dt_mvtc''')
The expected output is to have is this right? are there any other better aproach?
CodePudding user response:
In SQL GROUP BY can't use column aliases from a SELECT at the same level, as conceptually GROUP BY happens before SELECT.
So you project the new grouping column in a subquery, CTE or a separate dataframe. eg
select dt_mvtc,values,max_12, count(cd_cli) count_cd_cli
from (
select dt_mvtc,
max_t0,
max_12,
cd_cli,
CASE
WHEN vl_ren >= 0 AND vl_ren < 1000 THEN 0
WHEN vl_ren >= 1000 AND vl_ren < 2000 THEN 1
WHEN vl_ren >= 2000 AND vl_ren < 3000 THEN 2
WHEN vl_ren >= 3000 THEN 3
END AS values
from sbx_d4n0cbf.renda_presumida
) AS vl_ren_range
GROUP BY dt_mvtc,values,max_12
order by dt_mvtc
CodePudding user response:
You can round
down the vl_ren
values to the 100th floor of the vl_ren
value, and then group based on this. This way you don't have to create case
statements depending on data distribution.
data = [(514208, 1040.00, 0, 0, "2017-01-31"),
(30230361, 3720.00, 0, 0, "2017-01-31"),
(201188220, 2742.00, 0, 0, "2017-01-31"),
(204080612, 2968.00, 0, 0, "2017-01-31"),
(209727665, 860.00, 0, 0, "2017-01-31"),
(212491854, 792.00, 0, 0, "2017-01-31"),
(300597652, 1663.00, 0, 0, "2017-01-31"),
(300597652, 1643.00, 0, 0, "2017-01-31"),
(300836378, 2366.00, 0, 0, "2017-01-31"),
(301040450, 3394.00, 0, 0, "2017-01-31"),
(302394154, 2218.00, 0, 0, "2017-01-31"),]
df = spark.createDataFrame(data, ("cd_cli", "vl_ren" ,"max_t0", "max_12", "dt_mvtc"))
df.createOrReplaceTempView("df_table")
spark.sql('''select dt_mvtc,
max_t0,
max_12,
CASE
WHEN (vl_ren - round(vl_ren, -2)) <= 0 THEN round(vl_ren, -2) - 100
ELSE round(vl_ren, -2)
END AS vl_ren,
count(cd_cli) as count_cd_cli
FROM
df_table
GROUP BY 1, 2, 3, 4
''').show()
Output
---------- ------ ------ ------ ------------
| dt_mvtc|max_t0|max_12|vl_ren|count_cd_cli|
---------- ------ ------ ------ ------------
|2017-01-31| 0| 0| 800.0| 1|
|2017-01-31| 0| 0|1000.0| 1|
|2017-01-31| 0| 0| 700.0| 1|
|2017-01-31| 0| 0|2300.0| 1|
|2017-01-31| 0| 0|1600.0| 2|
|2017-01-31| 0| 0|2700.0| 1|
|2017-01-31| 0| 0|3300.0| 1|
|2017-01-31| 0| 0|2900.0| 1|
|2017-01-31| 0| 0|2200.0| 1|
|2017-01-31| 0| 0|3700.0| 1|
---------- ------ ------ ------ ------------