I have a datafame that looks like this:
import pandas as pd
df = pd.DataFrame(
{'ID': ['1', '1', '1', '1', '1',
'2' , '2', '3', '3'],
'Year': ["2012", "2013", "2014", "2015",
"2016", "2012", "2013", "2012", "2013"],
'Event': ['0', '0', '0', '1','0', '0',
'0', '1', '0']})
I want to create a new column where the values are centered around the event such that the time before the event is decreasing from 0, the time of the event is 0, and the time after the event is increasing from 0. In each case, time before and after the event would only be recorded for each ID. Some ID's do not have an event so they remain 0 and each event can only happen a maximum of one time for each ID.
I would like for the result to look like this:
out = pd.DataFrame(
{'ID': ['1', '1', '1', '1', '1',
'2', '2', '3', '3'],
'Year': ["2012", "2013", "2014", "2015",
"2016", "2012", "2013", "2012",
"2013"],
'Event': ['0', '0', '0', '1','0', '0',
'0', '1', '0'],
'Period': ['-3', '-2', '-1', '0',
'1', '0', '0', '0', '1']})
Any thoughts? Thank you in advance!
CodePudding user response:
What you can do is write a custom function called get_period
that takes in a pd.Series where a specific event value occurs once (in your case, the string '1'
), and returns a pd.Series integer range where 0
is at the same index where the event occurs.
For example, get_period(pd.Series(['0','0','0','1','0']))
determines that the length of the series is 5
, locates '1'
at index=3, then creates np.arange(5) = [0,1,2,3,4]
, subtracts 3 from each value, and returns pd.Series([-3,-2,-1,0,1])
.
Then we can perform a groupby ID
on your DataFrame, and apply the function get_period
to the Event
column.
import numpy as np
import pandas as pd
def get_period(s, event_value='1'):
event_idx = np.where(s == event_value)[0]
if len(np.where(s == event_value)[0]) == 0:
return pd.Series([0]*len(s))
else:
return pd.Series(np.arange(len(s)) - event_idx)
df = pd.DataFrame({'ID': ['1', '1', '1', '1', '1', '2' , '2', '3', '3'], 'Year': ["2012", "2013", "2014", "2015", "2016", "2012", "2013", "2012", "2013"], 'Event': ['0', '0', '0', '1','0', '0', '0', '1', '0']})
df['Period'] = df.groupby("ID").Event.apply(lambda x: get_period(x)).values
Result:
ID Year Event Period
0 1 2012 0 -3
1 1 2013 0 -2
2 1 2014 0 -1
3 1 2015 1 0
4 1 2016 0 1
5 2 2012 0 0
6 2 2013 0 0
7 3 2012 1 0
8 3 2013 0 1
CodePudding user response:
This is not so different from Derek )'s solution. But it's using pandas only.
def get_period(x):
if "1" in x["Event"].values:
out = (x.index -
x[x["Event"].eq("1")].index[0]).values
else:
out = [0] * len(x)
x["Period"] = out
return x
df = df.groupby("ID").apply(fun)\
.reset_index(drop=True)
df["Period"] = df["Period"].astype("int")