I was solving a practice question where I wanted to get the top 5 percentile of frauds for each state. I was able to solve it in SQL but the pandas gives a different answer for me than SQL.
Full Question
Top Percentile Fraud
ABC Corp is a mid-sized insurer in the US
and in the recent past their fraudulent claims have increased significantly for their personal auto insurance portfolio.
They have developed a ML based predictive model to identify
propensity of fraudulent claims.
Now, they assign highly experienced claim adjusters for top 5 percentile of claims identified by the model.
Your objective is to identify the top 5 percentile of claims from each state.
Your output should be policy number, state, claim cost, and fraud score.
Question: How to get the same answer in pandas that I obtained from SQL?
My attempt
- I break the fraud score in 100 equal parts using pandas cut and get categorical codes for each bins, then I took values above or equal to 95, but this gives different result.
- I am trying to get same answer that I got from SQL query.
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape) # (400, 4)
df.head(2)
policy_num state claim_cost fraud_score
0 ABCD1001 CA 4113 0.613
1 ABCD1002 CA 3946 0.156
Problem
- Group by each state, and find top 5 percentile fraud scores.
My attempt
df['state_ntile'] = df.groupby('state')['fraud_score']\
.apply(lambda ser: pd.cut(ser,100).cat.codes 1) # 1 makes 1 to 100 including.
df.query('state_ntile >=95')\
.sort_values(['state','fraud_score'],ascending=[True,False]).reset_index(drop=True)
Postgres SQL code ( I know SQL, I want answer in pandas)
SELECT policy_num,
state,
claim_cost,
fraud_score,
a.percentile
FROM
(SELECT *,
ntile(100) over(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM fraud_score)a
WHERE percentile <=5
The output I want
policy_num state claim_cost fraud_score percentile
0 ABCD1027 CA 2663 0.988 1
1 ABCD1016 CA 1639 0.964 2
2 ABCD1079 CA 4224 0.963 3
3 ABCD1081 CA 1080 0.951 4
4 ABCD1069 CA 1426 0.948 5
5 ABCD1222 FL 2392 0.988 1
6 ABCD1218 FL 1419 0.961 2
7 ABCD1291 FL 2581 0.939 3
8 ABCD1230 FL 2560 0.923 4
9 ABCD1277 FL 2057 0.923 5
10 ABCD1189 NY 3577 0.982 1
11 ABCD1117 NY 4903 0.978 2
12 ABCD1187 NY 3722 0.976 3
13 ABCD1196 NY 2994 0.973 4
14 ABCD1121 NY 4009 0.969 5
15 ABCD1361 TX 4950 0.999 1
16 ABCD1304 TX 1407 0.996 1
17 ABCD1398 TX 3191 0.978 2
18 ABCD1366 TX 2453 0.968 3
19 ABCD1386 TX 4311 0.963 4
20 ABCD1363 TX 4103 0.960 5
CodePudding user response:
Having spent over a decade with PostgreSQL (and the late, wonderful Greenplum), I have grown quite fond of duckdb
. It is very fast, can operate straight on (from/to) parquet files, etc. Definitely a space to watch.
Here is how it looks on your data:
duckdb.query_df(df, 'df', """
SELECT policy_num,
state,
claim_cost,
fraud_score,
a.percentile
FROM
(SELECT *,
ntile(100) over(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM df) as a
WHERE percentile <=5
""").df()
And the result:
policy_num state claim_cost fraud_score percentile
0 ABCD1222 FL 2392 0.988 1
1 ABCD1218 FL 1419 0.961 2
2 ABCD1291 FL 2581 0.939 3
3 ABCD1230 FL 2560 0.923 4
4 ABCD1277 FL 2057 0.923 5
5 ABCD1361 TX 4950 0.999 1
6 ABCD1304 TX 1407 0.996 1
7 ABCD1398 TX 3191 0.978 2
8 ABCD1366 TX 2453 0.968 3
9 ABCD1386 TX 4311 0.963 4
10 ABCD1363 TX 4103 0.960 5
11 ABCD1027 CA 2663 0.988 1
12 ABCD1016 CA 1639 0.964 2
13 ABCD1079 CA 4224 0.963 3
14 ABCD1081 CA 1080 0.951 4
15 ABCD1069 CA 1426 0.948 5
16 ABCD1189 NY 3577 0.982 1
17 ABCD1117 NY 4903 0.978 2
18 ABCD1187 NY 3722 0.976 3
19 ABCD1196 NY 2994 0.973 4
20 ABCD1121 NY 4009 0.969 5
CodePudding user response:
Thanks to Emma, I got the partial solution. I could not get the ranks like 1,2,3,...,100 but the resultant table is at least same from the output of SQL. I am still learning how to use the pandas.
Logic:
- To get the top 5 percentile, we can use quantile values >= 0.95 as shown below:
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape)
df['state_quantile'] = df.groupby('state')['fraud_score'].transform(lambda x: x.quantile(0.95))
dfx = df.query("fraud_score >= state_quantile").reset_index(drop=True)\
.sort_values(['state','fraud_score'],ascending=[True,False])
dfx
Result
policy_num state claim_cost fraud_score state_quantile
1 ABCD1027 CA 2663 0.988 0.94710
0 ABCD1016 CA 1639 0.964 0.94710
3 ABCD1079 CA 4224 0.963 0.94710
4 ABCD1081 CA 1080 0.951 0.94710
2 ABCD1069 CA 1426 0.948 0.94710
11 ABCD1222 FL 2392 0.988 0.91920
10 ABCD1218 FL 1419 0.961 0.91920
14 ABCD1291 FL 2581 0.939 0.91920
12 ABCD1230 FL 2560 0.923 0.91920
13 ABCD1277 FL 2057 0.923 0.91920
8 ABCD1189 NY 3577 0.982 0.96615
5 ABCD1117 NY 4903 0.978 0.96615
7 ABCD1187 NY 3722 0.976 0.96615
9 ABCD1196 NY 2994 0.973 0.96615
6 ABCD1121 NY 4009 0.969 0.96615
16 ABCD1361 TX 4950 0.999 0.96000
15 ABCD1304 TX 1407 0.996 0.96000
20 ABCD1398 TX 3191 0.978 0.96000
18 ABCD1366 TX 2453 0.968 0.96000
19 ABCD1386 TX 4311 0.963 0.96000
17 ABCD1363 TX 4103 0.960 0.96000