I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date | ID | Name | Transactions |
---|---|---|---|
02/02/2022 | ABC123 | Bob | 107 |
01/05/2022 | ACD232 | Emma | 34 |
12/03/2022 | HH254 | Kirsten | 23 |
12/11/2022 | HH254 | Kirsten | 47 |
CodePudding user response:
You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.
CodePudding user response:
You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value