I've data frame as below. I calculate percentile based on inputs provided.
I'd like to get count for each column that matches certain condition. For example, get count in a1 >value1
, similarly a2 > value2
and other column.
import pandas as pd
df = pd.DataFrame([[10,11,20],[580,11,20],
[500,11,20],
[110,111,420],[11,11,20],[80,91,90],
[80,91,'NA'],
[10,11,13],[0,14,1111],
[20,104,111],[220,314,1000],[200,30,2000],
[61,31,10],[516,71,20],[10,30,330]],
columns=['a1','a2','a3'])
calculate and describe column based on input percentile, for columns interested. drop NAs
print( (df[["a1","a2","a3"]].dropna()).describe(percentiles =[0.90,0.91,
0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99] ))
I face certain issues:
Column
a3
is removed. How do I save it from being thrown away, but simply throw away that row, or ignore NA?I can get value for each column as:
print(len(df[(df['a1']>200) ]))
print(len(df[(df['a2']>100) ]))
However, this gets tricky and unreadable when data frame has ~10 columns. How do I get counts in a data frame manner for columns for a condition (a1 > 100
, a2>90
, a3>56
)?
Thank you.
CodePudding user response:
If compare by dictionary with keys by all columns names and values for threshold in DataFrame.gt
get boolean DataFrame
, then for count True
s use sum
(because processing like 1
):
df = df.apply(pd.to_numeric, errors='coerce')
s = df.gt({'a1': 100, 'a2': 90, 'a3': 56}).sum()
print (s)
a1 6
a2 5
a3 7
dtype: int64
Details:
print(df.gt({'a1': 100, 'a2': 90, 'a3': 56}))
a1 a2 a3
0 False False False
1 True False False
2 True False False
3 True True True
4 False False False
5 False True True
6 False True False
7 False False False
8 False False True
9 False True True
10 True True True
11 True False True
12 False False False
13 True False False
14 False False True
Your solution working well for me if removed dropna
:
df = df.apply(pd.to_numeric, errors='coerce')
L = [0.90,0.91, 0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99]
print( df[["a1","a2","a3"]].describe(percentiles=L))
a1 a2 a3
count 15.000000 15.000000 14.000000
mean 160.533333 62.800000 370.357143
std 204.229166 79.165469 596.271054
min 0.000000 11.000000 10.000000
50% 80.000000 30.000000 55.000000
90% 509.600000 108.200000 1077.700000
91% 511.840000 109.180000 1092.130000
92% 514.080000 110.160000 1106.560000
93% 517.280000 115.060000 1191.010000
94% 526.240000 143.480000 1306.580000
95% 535.200000 171.900000 1422.150000
96% 544.160000 200.320000 1537.720000
97% 553.120000 228.740000 1653.290000
98% 562.080000 257.160000 1768.860000
99% 571.040000 285.580000 1884.430000
max 580.000000 314.000000 2000.000000
EDIT1: If need comapre quantiles by columns from list use:
df = df.apply(pd.to_numeric, errors='coerce')
cols = ['a1','a2','a3']
print (df[cols].quantile(0.5))
a1 80.0
a2 30.0
a3 55.0
Name: 0.5, dtype: float64
print (df[cols].gt(df[cols].quantile(0.5)))
a1 a2 a3
0 False False False
1 True False False
2 True False False
3 True True True
4 False False False
5 False True True
6 False True False
7 False False False
8 False False True
9 False True True
10 True True True
11 True False True
12 False True False
13 True True False
14 False False True