Home > Back-end >  pandas: get count within each column based on different arithmetic condition
pandas: get count within each column based on different arithmetic condition

Time:09-17

I've data frame as below. I calculate percentile based on inputs provided. I'd like to get count for each column that matches certain condition. For example, get count in a1 >value1, similarly a2 > value2 and other column.

import pandas as pd 

    df = pd.DataFrame([[10,11,20],[580,11,20],
        [500,11,20],
        [110,111,420],[11,11,20],[80,91,90],
        [80,91,'NA'],
        [10,11,13],[0,14,1111],
        [20,104,111],[220,314,1000],[200,30,2000],
        [61,31,10],[516,71,20],[10,30,330]],
         columns=['a1','a2','a3'])

calculate and describe column based on input percentile, for columns interested. drop NAs

print( (df[["a1","a2","a3"]].dropna()).describe(percentiles =[0.90,0.91,
    0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99] ))

I face certain issues:

  1. Column a3 is removed. How do I save it from being thrown away, but simply throw away that row, or ignore NA?

  2. I can get value for each column as:

print(len(df[(df['a1']>200) ]))
print(len(df[(df['a2']>100) ]))

However, this gets tricky and unreadable when data frame has ~10 columns. How do I get counts in a data frame manner for columns for a condition (a1 > 100, a2>90, a3>56 )?

Thank you.

CodePudding user response:

If compare by dictionary with keys by all columns names and values for threshold in DataFrame.gt get boolean DataFrame, then for count Trues use sum (because processing like 1):

df = df.apply(pd.to_numeric, errors='coerce')

s = df.gt({'a1': 100, 'a2': 90, 'a3': 56}).sum()
print (s)
a1    6
a2    5
a3    7
dtype: int64

Details:

print(df.gt({'a1': 100, 'a2': 90, 'a3': 56}))


       a1     a2     a3
0   False  False  False
1    True  False  False
2    True  False  False
3    True   True   True
4   False  False  False
5   False   True   True
6   False   True  False
7   False  False  False
8   False  False   True
9   False   True   True
10   True   True   True
11   True  False   True
12  False  False  False
13   True  False  False
14  False  False   True

Your solution working well for me if removed dropna:

df = df.apply(pd.to_numeric, errors='coerce')

L = [0.90,0.91, 0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99]
print( df[["a1","a2","a3"]].describe(percentiles=L))
               a1          a2           a3
count   15.000000   15.000000    14.000000
mean   160.533333   62.800000   370.357143
std    204.229166   79.165469   596.271054
min      0.000000   11.000000    10.000000
50%     80.000000   30.000000    55.000000
90%    509.600000  108.200000  1077.700000
91%    511.840000  109.180000  1092.130000
92%    514.080000  110.160000  1106.560000
93%    517.280000  115.060000  1191.010000
94%    526.240000  143.480000  1306.580000
95%    535.200000  171.900000  1422.150000
96%    544.160000  200.320000  1537.720000
97%    553.120000  228.740000  1653.290000
98%    562.080000  257.160000  1768.860000
99%    571.040000  285.580000  1884.430000
max    580.000000  314.000000  2000.000000

EDIT1: If need comapre quantiles by columns from list use:

df = df.apply(pd.to_numeric, errors='coerce')

cols = ['a1','a2','a3']
print (df[cols].quantile(0.5))
a1    80.0
a2    30.0
a3    55.0
Name: 0.5, dtype: float64

print (df[cols].gt(df[cols].quantile(0.5)))
       a1     a2     a3
0   False  False  False
1    True  False  False
2    True  False  False
3    True   True   True
4   False  False  False
5   False   True   True
6   False   True  False
7   False  False  False
8   False  False   True
9   False   True   True
10   True   True   True
11   True  False   True
12  False   True  False
13   True   True  False
14  False  False   True
  • Related