This post follows on from another one I posted which can be found here: use groupby() and for loop to count column values with conditions
I am working with the same data again:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
Like in the previous post, I first created a pd.Series
with the 1st day of every month in the entire history of the data
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I now want to do is count the number of rows in the data-frame where the df["start_date"]
values are less than the 1st day of each month in the series and where the df["end_date"]
values are greater than the 1st day of each month in the series
I would think that I would apply a lambda
function or use np.logical_and
on the dates
series to obtain the output I am after - the logic of which would look something like this:
#only obtain those rows with end dates
inactives = df[df["end_date"].isnull() == False]
dates.apply(
lambda x: (inactives[inactives["start_date"] < x] & inactives[inactives["cancel_date"] > x]).count()
)
or like this:
dates.apply(
lambda x: np.logical_and(
inactives[inactives["start_date"] < x,
inactives[inactives["cancel_date"] > x]]
).sum())
The resulting output would look like this:
month_first | count |
---|---|
2015-01-01 | 10 |
2015-02-01 | 25 |
2015-03-01 | 45 |
CodePudding user response:
Correct, we can use apply lambda for this. So, first, we create our list of first days in each month. Here we use freq
"MS"
to create start of month inside our defined interval.
new_df = pd.DataFrame({"month_first": pd.date_range(start="2015-01-01", end="2022-10-01", freq = "MS")})
This will result in this table:
month_first
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-04-01
4 2015-05-01
.. ...
89 2022-06-01
90 2022-07-01
91 2022-08-01
92 2022-09-01
93 2022-10-01
[94 rows x 1 columns]
Then we apply the lambda function below. So for each of the rows in our date range, we take from inactives
which the start_date
is less and end_date
is greater. We use &
operator to perform and
operation to each row of our resulting comparisons. Then, we use sum
to sum all the boolean values.
new_df["count"] = new_df["month_first"].apply(
lambda x: ((inactives["start_date"] < x) & (inactives["end_date"] > x)).sum())
This will result in this table:
month_first count
0 2015-01-01 0
1 2015-02-01 4
2 2015-03-01 9
3 2015-04-01 14
4 2015-05-01 19
.. ... ...
89 2022-06-01 25
90 2022-07-01 22
91 2022-08-01 19
92 2022-09-01 13
93 2022-10-01 13
[94 rows x 2 columns]