Home > other >  Pandas counting and suming specific conditions returns only nan
Pandas counting and suming specific conditions returns only nan

Time:09-23

I am trying to follow the otherwise excellent solution provided in the thread pandas-counting-and-summing-specific-conditions, but the code only ever outputs nan values, and with sum (not count), gives a future warning.

Basically, for each row in my df, I want to count how many dates in a column are within a range of /- 1 days of other dates in the same column.

If I were doing it in excel, the following muti-conditional sumproduct or countifs are possible:

= SUMPRODUCT(--(AN2>=$AN$2:$AN$35000-1),--(AN2<=$AN$2:$AN$35000 1)),

or

=countifs($AN$2:$AN$35000,">="&AN2-1,$AN$2:$AN$35000,"<="&AN2 1)

In python, trying the approach in the linked thread, I believe the code would be:

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

df["caseIntensity"] = df[(df['datet'] <= df['datet']   datetime.timedelta(days=1)) &\
                             (df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()

The output should be: 2, 2, 2, 3, 3, 2. Instead it is wholemeal nan!

Is it correct to assume that because I'm testing conditions, it doesn't matter if I sum or count? If I need to sum, I get a future warning about invalid columns (the columns are valid), which I don't understand. But mostly, my question is why am I only getting nan?

CodePudding user response:

I think what you're trying to sum doesn't match the logic you're trying to apply.

Use the following code:

Create a function which counts the number of days in that range Call that function for each row and save that as the the value of the new column

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

def get_dates_in_range(df_copy, row):
    return df_copy[(df_copy['datet'] <= row['datet']   datetime.timedelta(days=1)) &\
                     (df_copy['datet'] >= row['datet'] - datetime.timedelta(days=1))].shape[0]
    
    
df["caseIntensity"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)


       datet    caseIntensity
0   2020-03-04  2
1   2020-03-05  2
2   2020-03-09  2
3   2020-03-10  3
4   2020-03-11  3
5   2020-03-12  2

CodePudding user response:

Instead loops in apply is possible use vectorized solution, first create numpy arrays chained by &, compare and for counts Trues is possible use sum:

a = df['datet']
b = a   pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)
    
mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])

df["caseIntensity"]  = mask.sum(axis=1)
print (df)
       datet  caseIntensity
0 2020-03-04              2
1 2020-03-05              2
2 2020-03-09              2
3 2020-03-10              3
4 2020-03-11              3
5 2020-03-12              2

Here is perfomance for 6k rows:

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                         pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                         pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)


In [140]: %%timeit
     ...: a = df['datet']
     ...: b = a   pd.Timedelta(days=1)
     ...: c = a - pd.Timedelta(days=1)
     ...:     
     ...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
     ...: 
     ...: df["caseIntensity"]  = mask.sum(axis=1)
     ...: 
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [141]: %%timeit
     ...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
     ...: 
     ...: 
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related