I have some data that I would like to split into four groups based upon particular points in time - the points in time being given by particular dates.
The data I have is this (assume that df
has already been created):
df["date"] = pd.to_datetime(df["date"], format = "%Y-%m-%d")
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df.groupby(by = "year", as_index = False).agg({"month":pd.Series.nunique})
year | month |
---|---|
2015 | 3 |
2016 | 12 |
2017 | 12 |
2018 | 12 |
2019 | 12 |
2020 | 12 |
2021 | 12 |
2022 | 9 |
Notice that with this data, 2015 and 2022 are not full years.
My thinking was that because I have 84 months worth of data in total (3 (6*12) 9 = 84)
, I could split the data into four groups so that each group would have approximately 21 months worth of data in total 84 / 4 = 21
.
To do this, I would first begin with the earliest date in my data set which is 2015-10-02
. With this earliest data I would add on 21 months:
from dateutil.relativedelta import relativedelta
min_date = df["date"].min().date()
print([min_date, min_date relativedelta(months = 21)]
#output
[datetime.date(2015, 10, 2), datetime.date(2017, 7, 2)]
This date range would constitute the first bin which the first group would fall into
The second group would fall into a date range where the minimum date would be one day more than the maximum date of the previous group's date range:
"2017-07-02" relativedelta(days = 1) = "2017-07-03"
This would ensure that the bins of the different groups do not overlap.
The last group would have a bit less data in it as it would include data up till the latest date in the entire dataset which is 2022-09-30
Overall, the date range bins for the different groups would look something like this
Group | Date Range |
---|---|
A | "2015-10-02", "2017-07-02" |
B | "2017-07-03", "2019-04-03" |
C | "2019-04-04", "2021-01-04" |
D | "2021-01-05", "2022-9-30" |
I know that I could find these date ranges manually and use them to filter the data set to produce the groups with np.select
but this isn't very efficient.
df["Group"] = np.select(
condlist = [
(df["date"] >= "2015-10-02") & (df["date"] <= "2017-07-02"),
(df["date"] >= "2017-07-03") & (df["date"] <= "2019-04-03"),
(df["date"] >= "2019-04-04") & (df["date"] <= "2021-01-04"),
(df["date"] >= "2021-01-05") & (df["date"] <= "2022-09-30")
],
choicelist = ["A", "B", "C", "D"]
)
Surely there must be a way to find these values (in the way that I want them) without having to find them manually
CodePudding user response:
You may want to take a look at pd.cut
.
# toy data
df = pd.DataFrame(pd.date_range('2020-01-01', '2022-01-01'), columns = ['date'])
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
.. ...
You can generate the labels and boundaries for the bins.
from numpy import datetime64
bin_labels = [1, 2, 3, 4]
cut_bins = [datetime64('2019-12-31'), datetime64('2020-04-01'), datetime64('2020-12-31'), datetime64('2021-09-01'), datetime64('2022-01-01')]
And save the bins into a new column.
df['cut'] = pd.cut(df['date'], bins = cut_bins, labels = bin_labels]
date cut
0 2020-01-01 1
1 2020-01-02 1
2 2020-01-03 1
3 2020-01-04 1
4 2020-01-05 1
.. ... ..
727 2021-12-28 4
728 2021-12-29 4
729 2021-12-30 4
730 2021-12-31 4
731 2022-01-01 4
Hope it helps.
CodePudding user response:
I have found a way which I think works (for those who may be interested in binning date-time values in the future) - assume the data is the same as given in the question description:
from dateutil.relativedelta import relativedelta
import numpy as np
dates = []
start = df["date"].min().date()
dates.append(np.datetime64(start))
while start <= df["date"].max().date():
start = start relativedetla(months = 21)
dates.append(np.datetime64(start))
df["Group"] = pd.cut(
df["date"], bins = dates,
labels = ["A", "B", "C", "D"],
right = False #right = False ensures no group overlap in date values
)