Home > Blockchain >  Efficiently counting records with date in between two columns
Efficiently counting records with date in between two columns

Time:10-23

Say I have this DataFrame:

user sub_date unsub_date group
0 alice 2021-01-01 00:00:00 2021-02-09 00:00:00 A
1 bob 2021-02-03 00:00:00 2021-04-05 00:00:00 B
2 charlie 2021-02-03 00:00:00 NaT A
3 dave 2021-01-29 00:00:00 2021-09-01 00:00:00 B

What is the most efficient way to count the subbed users per date and per group? In other words, to get this DataFrame:

date group subbed
2021-01-01 A 1
2021-01-01 B 0
2021-01-02 A 1
2021-01-02 B 0
... ... ...
2021-02-03 A 2
2021-02-03 B 2
... ... ...
2021-02-10 A 1
2021-02-10 B 2
... ... ...

Here's a snippet to init the example df:

import pandas as pd
import datetime as dt

users = pd.DataFrame(
    [
        ["alice", "2021-01-01", "2021-02-09", "A"],
        ["bob", "2021-02-03", "2021-04-05", "B"],
        ["charlie", "2021-02-03", None, "A"],
        ["dave", "2021-01-29", "2021-09-01", "B"],
    ],
    columns=["user", "sub_date", "unsub_date", "group"],
)

users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
    pd.to_datetime
)

CodePudding user response:

Using a smaller date range for convenience

Note: my users df is different from OPs. I've changed around a few dates to make the outputs smaller

In [26]: import pandas as pd
    ...: import datetime as dt
    ...:
    ...: users = pd.DataFrame(
    ...:     [
    ...:         ["alice", "2021-01-01", "2021-01-05", "A"],
    ...:         ["bob", "2021-01-03", "2021-01-07", "B"],
    ...:         ["charlie", "2021-01-03", None, "A"],
    ...:         ["dave", "2021-01-09", "2021-01-11", "B"],
    ...:     ],
    ...:     columns=["user", "sub_date", "unsub_date", "group"],
    ...: )
    ...:
    ...: users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
    ...:     pd.to_datetime
    ...: )

In [81]: users
Out[81]:
      user   sub_date unsub_date group
0    alice 2021-01-01 2021-01-05     A
1      bob 2021-01-03 2021-01-07     B
2  charlie 2021-01-03        NaT     A
3     dave 2021-01-09 2021-01-11     B

In [82]: users.melt(id_vars=['user', 'group'])
Out[82]:
      user group    variable      value
0    alice     A    sub_date 2021-01-01
1      bob     B    sub_date 2021-01-03
2  charlie     A    sub_date 2021-01-03
3     dave     B    sub_date 2021-01-09
4    alice     A  unsub_date 2021-01-05
5      bob     B  unsub_date 2021-01-07
6  charlie     A  unsub_date        NaT
7     dave     B  unsub_date 2021-01-11

# dropna to remove rows with no unsub_date
# sort_values to sort by date
# sub_date exists -> map to 1, else -1 then take cumsum to get # of subbed people at that date

In [85]: melted = users.melt(id_vars=['user', 'group']).dropna().sort_values('value')
    ...: melted['sub_value'] = np.where(melted['variable'] == 'sub_date', 1, -1) # or melted['variable'].map({'sub_date': 1, 'unsub_date': -1})
    ...: melted['sub_cumsum_group'] = melted.groupby('group')['sub_value'].cumsum()
    ...: melted
Out[85]:
      user group    variable      value  sub_value  sub_cumsum_group
0    alice     A    sub_date 2021-01-01          1                 1
1      bob     B    sub_date 2021-01-03          1                 1
2  charlie     A    sub_date 2021-01-03          1                 2
4    alice     A  unsub_date 2021-01-05         -1                 1
5      bob     B  unsub_date 2021-01-07         -1                 0
3     dave     B    sub_date 2021-01-09          1                 1
7     dave     B  unsub_date 2021-01-11         -1                 0

In [93]: idx = pd.date_range(melted['value'].min(), melted['value'].max(), freq='1D')
    ...: idx
Out[93]:
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09', '2021-01-10', '2021-01-11'],
              dtype='datetime64[ns]', freq='D')

In [94]: melted.set_index('value').groupby('group')['sub_cumsum_group'].apply(lambda x: x.reindex(idx).ffill().fillna(0))
Out[94]:
group
A      2021-01-01    1.0
       2021-01-02    1.0
       2021-01-03    2.0
       2021-01-04    2.0
       2021-01-05    1.0
       2021-01-06    1.0
       2021-01-07    1.0
       2021-01-08    1.0
       2021-01-09    1.0
       2021-01-10    1.0
       2021-01-11    1.0
B      2021-01-01    0.0
       2021-01-02    0.0
       2021-01-03    1.0
       2021-01-04    1.0
       2021-01-05    1.0
       2021-01-06    1.0
       2021-01-07    0.0
       2021-01-08    0.0
       2021-01-09    1.0
       2021-01-10    1.0
       2021-01-11    0.0
Name: sub_cumsum_group, dtype: float64

CodePudding user response:

Try this?

>>> users.groupby(['sub_date','group'])[['user']].count()

CodePudding user response:

The data is described by step functions, and step function plot of A subs

Next step is to sample the step function at whatever dates you want, eg for every day of January..

sc.sample(stepfunctions, pd.date_range("2021-01-01", "2021-02-01")).melt(ignore_index=False).reset_index()

The result is this

   group   variable  value
0      A 2021-01-01      1
1      B 2021-01-01      0
2      A 2021-01-02      1
3      B 2021-01-02      0
4      A 2021-01-03      1
..   ...        ...    ...
59     B 2021-01-30      1
60     A 2021-01-31      1
61     B 2021-01-31      1
62     A 2021-02-01      1
63     B 2021-02-01      1
  • Related