How can I bucket/bin a dataframe in python based on the year?-CodePudding

Let's say I have a very simple dataframe with one column, year.

There would be 14 distinct years, from 2010 to 2023.

I would need to bin/bucket these years into three categories, 'old', 'medium', and 'new' where new would be the 3 most recent years (2023,2022,2021), medium would be 2015-2020, and old would be 2010-2014.

How would I do this?

CodePudding user response：

You're looking for pandas.cut.

Assuming (df) if your dataframe, you can use :

bins = [2010, 2014, 2020, 2023]
labels = ["old", "medium", "new"]

df["cat"] = pd.cut(df["year"], bins=bins, labels=labels, include_lowest=True, right=True)

And here is an example to show you the output :

(
    pd.DataFrame(pd.date_range("2010", periods=14, freq="Y").year, columns=["year"])
        .assign(cat = lambda df_: pd.cut(df_["year"],
                                         bins=[2010, 2014, 2020, 2023],
                                         labels=["old", "medium", "new"],
                                         include_lowest=True, right=True))
)

Output :

    year     cat
0   2010     old
1   2011     old
2   2012     old
3   2013     old
4   2014     old
5   2015  medium
6   2016  medium
7   2017  medium
8   2018  medium
9   2019  medium
10  2020  medium
11  2021     new
12  2022     new
13  2023     new

CodePudding user response：

You may just create a hash like the following and use the year as the key to get its bin.

bins = {'2023' : 'new', 
        '2022' : 'new',
        '2021' : 'new',
        '2020' : 'medium',
        '2019' : 'medium',
        '2018' : 'medium',
        '2017' : 'medium',
        '2016' : 'medium',
        '2015' : 'medium',
        '2014' : 'old',
        '2013' : 'old',
        '2012' : 'old',
        '2011' : 'old',
        '2010' : 'old'
        }