Pandas Dataframe Column Coding Before After Occurrence Time Series-CodePudding

I have a datafame that looks like this:

import pandas as pd

df = pd.DataFrame(
    {'ID': ['1', '1', '1', '1', '1',
            '2' , '2', '3', '3'], 
     'Year': ["2012", "2013", "2014", "2015",
              "2016", "2012", "2013", "2012", "2013"], 
     'Event': ['0', '0', '0', '1','0', '0',
               '0', '1', '0']})

I want to create a new column where the values are centered around the event such that the time before the event is decreasing from 0, the time of the event is 0, and the time after the event is increasing from 0. In each case, time before and after the event would only be recorded for each ID. Some ID's do not have an event so they remain 0 and each event can only happen a maximum of one time for each ID.

I would like for the result to look like this:

out = pd.DataFrame(
    {'ID': ['1', '1', '1', '1', '1', 
            '2', '2', '3', '3'], 
     'Year': ["2012", "2013", "2014", "2015",
              "2016", "2012", "2013", "2012",
              "2013"], 
     'Event': ['0', '0', '0', '1','0', '0',
               '0', '1', '0'], 
     'Period': ['-3', '-2', '-1', '0',
                '1', '0', '0', '0', '1']})

Any thoughts? Thank you in advance!

CodePudding user response：

What you can do is write a custom function called get_period that takes in a pd.Series where a specific event value occurs once (in your case, the string '1'), and returns a pd.Series integer range where 0 is at the same index where the event occurs.

For example, get_period(pd.Series(['0','0','0','1','0'])) determines that the length of the series is 5, locates '1' at index=3, then creates np.arange(5) = [0,1,2,3,4], subtracts 3 from each value, and returns pd.Series([-3,-2,-1,0,1]).

Then we can perform a groupby ID on your DataFrame, and apply the function get_period to the Event column.

import numpy as np
import pandas as pd

def get_period(s, event_value='1'):
    event_idx = np.where(s == event_value)[0]
    if len(np.where(s == event_value)[0]) == 0:
        return pd.Series([0]*len(s))
    else:
        return pd.Series(np.arange(len(s)) - event_idx)

df = pd.DataFrame({'ID': ['1', '1', '1', '1', '1', '2' , '2', '3', '3'], 'Year': ["2012", "2013", "2014", "2015", "2016", "2012", "2013", "2012", "2013"], 'Event': ['0', '0', '0', '1','0', '0', '0', '1', '0']})
df['Period'] = df.groupby("ID").Event.apply(lambda x: get_period(x)).values

Result:

  ID  Year Event  Period
0  1  2012     0      -3
1  1  2013     0      -2
2  1  2014     0      -1
3  1  2015     1       0
4  1  2016     0       1
5  2  2012     0       0
6  2  2013     0       0
7  3  2012     1       0
8  3  2013     0       1

CodePudding user response：

This is not so different from Derek )'s solution. But it's using pandas only.

def get_period(x):
    if "1" in x["Event"].values:
        out = (x.index - 
               x[x["Event"].eq("1")].index[0]).values
    else:
        out = [0] * len(x)
    x["Period"] = out
    return x

df = df.groupby("ID").apply(fun)\
       .reset_index(drop=True)
df["Period"] = df["Period"].astype("int")