Home > Mobile >  Groupby Pandas , calculate multiple columns based on date difference
Groupby Pandas , calculate multiple columns based on date difference

Time:10-14

I have a pandas dataframe shown below:

CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103 

I want to create three columns:

days_diff:

  1. This has to be created in a way that if the difference b/w the first Date and corresponding rows is greater than 30 belonging to the same CID then assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values

  2. If MIDis not null and belong to same CID group assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values

Otherwise just fetch the date difference b/w the first row belonging to the same CID for the corresponding rows

A: This depends on the days_diff column , this column is like a counter it will only change/increment when there's another NAT occurrence for the same CID and reset itself for every CID.

B: This column depends on the column A , if the value in A remains same it won't change otherwise increments

It's a bit complicated to explain please refer to the output below for reference. I have used .groupby() .diff() and .shift() methods to create multiple dummy columns in order to calculate this and still working on it, please let me know the best way to go about this, thanks

My expected output :

CID RefID   Date        Group   MID     days_diff       A   B
100     1   1/01/2021       A           NAT             1   1
100     2   3/01/2021       A           2 days          1   1
100     3   4/01/2021       A   101     3 days          1   1
100     4   15/01/2021      A           NAT             2   4
100     5   18/01/2021      A           3 days          2   4
200     6   3/03/2021       B           NAT             1   6
200     7   4/04/2021       B           NAT             2   7
200     8   9/04/2021       B   102     5 days          2   7
200     9   25/04/2021      B           NAT             3   9
300     10  26/04/2021      C           NAT             1   10
300     11  27/05/2021      C           NAT             2   11
300     12  28/05/2021      C   103     1 day           2   11

CodePudding user response:

You could do something like this:

def days_diff(sdf):
    result = pd.DataFrame(
        {"days_diff": pd.NaT, "A": None}, index=sdf.index
    )
    start = sdf.at[sdf.index[0], "Date"]
    for index, day, next_MID_is_na in zip(
        sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
    ):
        diff = (day - start).days
        if diff <= 30 and next_MID_is_na:
            result.at[index, "days_diff"] = diff
        else:
            start = day
    result.A = result.days_diff.isna().cumsum()
    return result

df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()

Result for df created by

from io import StringIO
data = StringIO(
'''
CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")

is

    CID  RefID       Date Group    MID days_diff  A     B
0   100      1 2021-01-01     A    NaN       NaT  1   1.0
1   100      2 2021-01-03     A    NaN         2  1   1.0
2   100      3 2021-01-04     A  101.0         3  1   1.0
3   100      4 2021-01-15     A    NaN       NaT  2   4.0
4   100      5 2021-01-18     A    NaN         3  2   4.0
5   200      6 2021-03-03     B    NaN       NaT  1   6.0
6   200      7 2021-04-04     B    NaN       NaT  2   7.0
7   200      8 2021-04-09     B  102.0         5  2   7.0
8   200      9 2021-04-25     B    NaN       NaT  3   9.0
9   300     10 2021-04-26     C    NaN       NaT  1  10.0
10  300     11 2021-05-27     C    NaN       NaT  2  11.0
11  300     12 2021-05-28     C  103.0         1  2  11.0
  • Related