My table:
df = pd.DataFrame({'cid': ['A','B','C','D','E','F'], 'mid': ['1','1','2','2','2','3'], 'date_start': ['2019-05-17', '2020-01-20', '2018-02-10','2019-03-10','2020-03-10', '2018-04-11'], 'date_end': ['2020-05-17', '2021-01-20', '2019-02-10', '2020-03-10', '2021-03-15', '2021-04-11']})
df['date_start'] = pd.to_datetime(testdf['date_start']).dt.date
df['date_end'] = pd.to_datetime(testdf['date_end']).dt.date
span = []
for i in range(df.shape[0]):
span.append((df['date_start'][i], df['date_end'][i]))
df['span'] = pd.Series(span)
cid mid span
A 1 (2019-05-17, 2020-05-17)
B 1 (2020-01-20, 2021-01-20)
C 2 (2018-02-10, 2019-02-10)
D 2 (2019-03-10, 2020-03-10)
E 2 (2020-03-10, 2021-03-15)
F 3 (2018-04-11, 2021-04-11)
What I want: (I want a tuple/list of all the spans of the previous rows for the window in each row. Which should be roughly equivalent to the pgsql query array_agg(span) over (partition by mid rows between unbounded preceding and 1 preceding)
. How can I do this in python using pandas?)
cid mid span prev_dates
A 1 (2019-05-17, 2020-05-17) Null
B 1 (2020-01-20, 2021-01-20) ((2019-05-17, 2020-05-17))
C 2 (2018-02-10, 2019-02-10) Null
D 2 (2019-03-10, 2020-03-10) ((2018-02-10, 2019-02-10))
E 2 (2020-03-10, 2021-03-15) ((2018-02-10, 2019-02-10), (2019-03-10, 2020-03-10))
F 3 (2018-04-11, 2021-04-11) Null
I tried this: df.groupby('mid')['span'].transform(lambda x: tuple(x))
, not working as intended. I think there should be a rolling() function added, but this itself is not working where I thought it should at least give the tuple of all the spans for each group.
df.groupby('mid')['span'].agg(lambda x: tuple(x))
does somewhat like the array_agg function from pgsql, but in this way I'm not getting all the rows. If I use transform, I'm not getting what I want.
CodePudding user response:
Try this
df["prev_dates"] = [
window.to_list()[:-1]
for window in df.groupby("mid")["span"].expanding()
]
The groupby(...).expanding(...)
does almost the same thing as the original PARTITION BY ... BETWEEN ... AND ...
in pgsql. An expanding window always contains the current row so we simply exclude the last row from the result set.
CodePudding user response:
You can use an accumulation function:
def acc(s):
l = []
return pd.Series([pd.NA] [tuple(l:=l [x]) for x in s[:-1]],
index=s.index)
df['prev_dates'] = df.groupby('mid')['span'].apply(acc)
Output:
cid mid date_start date_end span prev_dates
0 A 1 2019-05-17 2020-05-17 (2019-05-17, 2020-05-17) <NA>
1 B 1 2020-01-20 2021-01-20 (2020-01-20, 2021-01-20) ((2019-05-17, 2020-05-17),)
2 C 2 2018-02-10 2019-02-10 (2018-02-10, 2019-02-10) <NA>
3 D 2 2019-03-10 2020-03-10 (2019-03-10, 2020-03-10) ((2018-02-10, 2019-02-10),)
4 E 2 2020-03-10 2021-03-15 (2020-03-10, 2021-03-15) ((2018-02-10, 2019-02-10), (2019-03-10, 2020-03-10))
5 F 3 2018-04-11 2021-04-11 (2018-04-11, 2021-04-11) <NA>