I have a pandas dataframe shown below:
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
I want to create three columns:
days_diff:
This has to be created in a way that if the difference b/w the first Date and corresponding rows is greater than 30 belonging to the same CID then assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values
If MIDis not null and belong to same CID group assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values
Otherwise just fetch the date difference b/w the first row belonging to the same CID for the corresponding rows
A: This depends on the days_diff column , this column is like a counter it will only change/increment when there's another NAT occurrence for the same CID and reset itself for every CID.
B: This column depends on the column A , if the value in A remains same it won't change otherwise increments
It's a bit complicated to explain please refer to the output below for reference. I have used .groupby()
.diff()
and .shift()
methods to create multiple dummy columns in order to calculate this and still working on it, please let me know the best way to go about this, thanks
My expected output :
CID RefID Date Group MID days_diff A B
100 1 1/01/2021 A NAT 1 1
100 2 3/01/2021 A 2 days 1 1
100 3 4/01/2021 A 101 3 days 1 1
100 4 15/01/2021 A NAT 2 4
100 5 18/01/2021 A 3 days 2 4
200 6 3/03/2021 B NAT 1 6
200 7 4/04/2021 B NAT 2 7
200 8 9/04/2021 B 102 5 days 2 7
200 9 25/04/2021 B NAT 3 9
300 10 26/04/2021 C NAT 1 10
300 11 27/05/2021 C NAT 2 11
300 12 28/05/2021 C 103 1 day 2 11
CodePudding user response:
You could do something like this:
def days_diff(sdf):
result = pd.DataFrame(
{"days_diff": pd.NaT, "A": None}, index=sdf.index
)
start = sdf.at[sdf.index[0], "Date"]
for index, day, next_MID_is_na in zip(
sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
):
diff = (day - start).days
if diff <= 30 and next_MID_is_na:
result.at[index, "days_diff"] = diff
else:
start = day
result.A = result.days_diff.isna().cumsum()
return result
df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()
Result for df
created by
from io import StringIO
data = StringIO(
'''
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")
is
CID RefID Date Group MID days_diff A B
0 100 1 2021-01-01 A NaN NaT 1 1.0
1 100 2 2021-01-03 A NaN 2 1 1.0
2 100 3 2021-01-04 A 101.0 3 1 1.0
3 100 4 2021-01-15 A NaN NaT 2 4.0
4 100 5 2021-01-18 A NaN 3 2 4.0
5 200 6 2021-03-03 B NaN NaT 1 6.0
6 200 7 2021-04-04 B NaN NaT 2 7.0
7 200 8 2021-04-09 B 102.0 5 2 7.0
8 200 9 2021-04-25 B NaN NaT 3 9.0
9 300 10 2021-04-26 C NaN NaT 1 10.0
10 300 11 2021-05-27 C NaN NaT 2 11.0
11 300 12 2021-05-28 C 103.0 1 2 11.0