Alternative for array_agg while keeping all the rows, excluding current row from window-CodePudding

My table:

df = pd.DataFrame({'cid': ['A','B','C','D','E','F'], 'mid': ['1','1','2','2','2','3'], 'date_start': ['2019-05-17', '2020-01-20', '2018-02-10','2019-03-10','2020-03-10', '2018-04-11'], 'date_end': ['2020-05-17', '2021-01-20', '2019-02-10', '2020-03-10', '2021-03-15', '2021-04-11']})
df['date_start'] = pd.to_datetime(testdf['date_start']).dt.date
df['date_end'] = pd.to_datetime(testdf['date_end']).dt.date

span = []

for i in range(df.shape[0]):
    span.append((df['date_start'][i], df['date_end'][i]))

df['span'] = pd.Series(span)

cid mid span
A   1   (2019-05-17, 2020-05-17)
B   1   (2020-01-20, 2021-01-20)
C   2   (2018-02-10, 2019-02-10)
D   2   (2019-03-10, 2020-03-10)
E   2   (2020-03-10, 2021-03-15)
F   3   (2018-04-11, 2021-04-11)

What I want: (I want a tuple/list of all the spans of the previous rows for the window in each row. Which should be roughly equivalent to the pgsql query array_agg(span) over (partition by mid rows between unbounded preceding and 1 preceding). How can I do this in python using pandas?)

cid mid span                      prev_dates
A   1   (2019-05-17, 2020-05-17)  Null
B   1   (2020-01-20, 2021-01-20)  ((2019-05-17, 2020-05-17))
C   2   (2018-02-10, 2019-02-10)  Null
D   2   (2019-03-10, 2020-03-10)  ((2018-02-10, 2019-02-10))
E   2   (2020-03-10, 2021-03-15)  ((2018-02-10, 2019-02-10), (2019-03-10, 2020-03-10))
F   3   (2018-04-11, 2021-04-11)  Null

I tried this: df.groupby('mid')['span'].transform(lambda x: tuple(x)), not working as intended. I think there should be a rolling() function added, but this itself is not working where I thought it should at least give the tuple of all the spans for each group.

df.groupby('mid')['span'].agg(lambda x: tuple(x)) does somewhat like the array_agg function from pgsql, but in this way I'm not getting all the rows. If I use transform, I'm not getting what I want.

CodePudding user response：

Try this

df["prev_dates"] = [
    window.to_list()[:-1]
    for window in df.groupby("mid")["span"].expanding()
]

The groupby(...).expanding(...) does almost the same thing as the original PARTITION BY ... BETWEEN ... AND ... in pgsql. An expanding window always contains the current row so we simply exclude the last row from the result set.

CodePudding user response：

You can use an accumulation function:

def acc(s):
    l = []
    return pd.Series([pd.NA] [tuple(l:=l [x]) for x in s[:-1]],
                          index=s.index)

df['prev_dates'] = df.groupby('mid')['span'].apply(acc)

Output:

  cid mid  date_start    date_end                      span                                            prev_dates
0   A   1  2019-05-17  2020-05-17  (2019-05-17, 2020-05-17)                                                  <NA>
1   B   1  2020-01-20  2021-01-20  (2020-01-20, 2021-01-20)                           ((2019-05-17, 2020-05-17),)
2   C   2  2018-02-10  2019-02-10  (2018-02-10, 2019-02-10)                                                  <NA>
3   D   2  2019-03-10  2020-03-10  (2019-03-10, 2020-03-10)                           ((2018-02-10, 2019-02-10),)
4   E   2  2020-03-10  2021-03-15  (2020-03-10, 2021-03-15)  ((2018-02-10, 2019-02-10), (2019-03-10, 2020-03-10))
5   F   3  2018-04-11  2021-04-11  (2018-04-11, 2021-04-11)                                                  <NA>