pandas.Series.apply() lambda function to count data-frame column values with conditions-CodePudding

This post follows on from another one I posted which can be found here: use groupby() and for loop to count column values with conditions

I am working with the same data again:

import pandas as pd
import numpy as np
from datetime import timedelta

random.seed(365)

#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date   timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
    {"start_date":start_date,
    "end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")

Like in the previous post, I first created a pd.Series with the 1st day of every month in the entire history of the data

dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time

What I now want to do is count the number of rows in the data-frame where the df["start_date"] values are less than the 1st day of each month in the series and where the df["end_date"] values are greater than the 1st day of each month in the series

I would think that I would apply a lambda function or use np.logical_and on the dates series to obtain the output I am after - the logic of which would look something like this:

#only obtain those rows with end dates
inactives = df[df["end_date"].isnull() == False]

dates.apply(
    lambda x: (inactives[inactives["start_date"] < x] & inactives[inactives["cancel_date"] > x]).count()
)

or like this:

dates.apply(
    lambda x: np.logical_and(
        inactives[inactives["start_date"] < x,
        inactives[inactives["cancel_date"] > x]]
    ).sum())

The resulting output would look like this:

month_first	count
2015-01-01	10
2015-02-01	25
2015-03-01	45

CodePudding user response：

Correct, we can use apply lambda for this. So, first, we create our list of first days in each month. Here we use freq "MS" to create start of month inside our defined interval.

new_df = pd.DataFrame({"month_first": pd.date_range(start="2015-01-01", end="2022-10-01", freq = "MS")})

This will result in this table:

   month_first
0   2015-01-01
1   2015-02-01
2   2015-03-01
3   2015-04-01
4   2015-05-01
..         ...
89  2022-06-01
90  2022-07-01
91  2022-08-01
92  2022-09-01
93  2022-10-01

[94 rows x 1 columns]

Then we apply the lambda function below. So for each of the rows in our date range, we take from inactives which the start_date is less and end_date is greater. We use & operator to perform and operation to each row of our resulting comparisons. Then, we use sum to sum all the boolean values.

new_df["count"] = new_df["month_first"].apply(
    lambda x: ((inactives["start_date"] < x) & (inactives["end_date"] > x)).sum())

This will result in this table:

   month_first  count
0   2015-01-01      0
1   2015-02-01      4
2   2015-03-01      9
3   2015-04-01     14
4   2015-05-01     19
..         ...    ...
89  2022-06-01     25
90  2022-07-01     22
91  2022-08-01     19
92  2022-09-01     13
93  2022-10-01     13

[94 rows x 2 columns]