Bin datetime.date objects to create groups-CodePudding

I have some data that I would like to split into four groups based upon particular points in time - the points in time being given by particular dates.

The data I have is this (assume that df has already been created):

df["date"] = pd.to_datetime(df["date"], format = "%Y-%m-%d")
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month

df.groupby(by = "year", as_index = False).agg({"month":pd.Series.nunique})

year	month
2015	3
2016	12
2017	12
2018	12
2019	12
2020	12
2021	12
2022	9

Notice that with this data, 2015 and 2022 are not full years.

My thinking was that because I have 84 months worth of data in total (3 (6*12) 9 = 84), I could split the data into four groups so that each group would have approximately 21 months worth of data in total 84 / 4 = 21.

To do this, I would first begin with the earliest date in my data set which is 2015-10-02. With this earliest data I would add on 21 months:

from dateutil.relativedelta import relativedelta

min_date = df["date"].min().date()
print([min_date, min_date   relativedelta(months = 21)]

#output
[datetime.date(2015, 10, 2), datetime.date(2017, 7, 2)]

This date range would constitute the first bin which the first group would fall into

The second group would fall into a date range where the minimum date would be one day more than the maximum date of the previous group's date range:

"2017-07-02"   relativedelta(days = 1) = "2017-07-03"

This would ensure that the bins of the different groups do not overlap.

The last group would have a bit less data in it as it would include data up till the latest date in the entire dataset which is 2022-09-30

Overall, the date range bins for the different groups would look something like this

Group	Date Range
A	"2015-10-02", "2017-07-02"
B	"2017-07-03", "2019-04-03"
C	"2019-04-04", "2021-01-04"
D	"2021-01-05", "2022-9-30"

I know that I could find these date ranges manually and use them to filter the data set to produce the groups with np.select but this isn't very efficient.

df["Group"] = np.select(
    condlist = [
        (df["date"] >= "2015-10-02") & (df["date"] <= "2017-07-02"),
        (df["date"] >= "2017-07-03") & (df["date"] <= "2019-04-03"),
        (df["date"] >= "2019-04-04") & (df["date"] <= "2021-01-04"),
        (df["date"] >= "2021-01-05") & (df["date"] <= "2022-09-30")
    ],
    choicelist = ["A", "B", "C", "D"]
)

Surely there must be a way to find these values (in the way that I want them) without having to find them manually

CodePudding user response：

You may want to take a look at pd.cut.

# toy data
df = pd.DataFrame(pd.date_range('2020-01-01', '2022-01-01'), columns = ['date'])

          date
0   2020-01-01
1   2020-01-02
2   2020-01-03
3   2020-01-04
4   2020-01-05
..         ...

You can generate the labels and boundaries for the bins.

from numpy import datetime64
bin_labels = [1, 2, 3, 4]
cut_bins = [datetime64('2019-12-31'), datetime64('2020-04-01'), datetime64('2020-12-31'), datetime64('2021-09-01'), datetime64('2022-01-01')]

And save the bins into a new column.

df['cut'] = pd.cut(df['date'], bins = cut_bins, labels = bin_labels]

          date cut
0   2020-01-01   1
1   2020-01-02   1
2   2020-01-03   1
3   2020-01-04   1
4   2020-01-05   1
..         ...  ..
727 2021-12-28   4
728 2021-12-29   4
729 2021-12-30   4
730 2021-12-31   4
731 2022-01-01   4

Hope it helps.

CodePudding user response：

I have found a way which I think works (for those who may be interested in binning date-time values in the future) - assume the data is the same as given in the question description:

from dateutil.relativedelta import relativedelta
import numpy as np

dates = []
start = df["date"].min().date()
dates.append(np.datetime64(start))
while start <= df["date"].max().date():
    start = start   relativedetla(months = 21)
    dates.append(np.datetime64(start))

df["Group"] = pd.cut(
    df["date"], bins = dates,
    labels = ["A", "B", "C", "D"],
    right = False #right = False ensures no group overlap in date values
)