Home > Enterprise >  How can I get nearest dates to year end in dataframe column?
How can I get nearest dates to year end in dataframe column?

Time:10-12

I have a dataframe column from 1995 to today and I would like to create a boolean array that has True for the nearest date BEFORE the next year starts.

So for

["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]

I expect:

[False, False, True, False]

How can I do that? Thanks!

CodePudding user response:

groupby by year then tail to get latest in each year, then us .isin to check if given date is latest for given year

import datetime
import pandas as pd
df = pd.DataFrame({"date":["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]})
df["year"] = df["date"].str.extract("(\d\d\d\d)")
latests = df.groupby("year").tail(1)["date"]
df["islast"] = df["date"].isin(latests)
print(df)

output

         date  year  islast
0  2000-12-22  2000   False
1  2000-12-26  2000   False
2  2000-12-29  2000    True
3  2001-01-02  2001    True

Disclaimer: this solution assumes dates are already sorted in ascending manner.

CodePudding user response:

Use the builtin datetime library, https://docs.python.org/3/library/datetime.html, to achieve this without external dependencies like pandas, numpy etc.

from datetime import date

Convert the supplied list of date strings, assumed to be in isoformat, to datetime objects:

list_of_date_strings = ["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]
datetime_list = [date.fromisoformat(d) for d in list_of_date_strings]

Now convert to day of year:

datetime_years_of_day = [d.timetuple().tm_yday for d in datetime_list]

and create the mask:

mask = [d == max(datetime_days_of_year) for d in datetime_days_of_year]

print(mask) # -> [False, False, True, False]

In one go:

    from datetime import date
    list_of_date_strings = ["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]
    datetime_list = [date.fromisoformat(d) for d in list_of_date_strings]
    datetime_days_of_year = [d.timetuple().tm_yday for d in datetime_list]
    mask = [d == max(datetime_days_of_year) for d in datetime_days_of_year]
    print(mask) # -> [False, False, True, False]

Hey presto!

!TIP: No need to sort in fact. But if you want the one with the highest day in the year as per your comment just do

index_at_max = datetime_days_of_year.index(max(datetime_days_of_year))
max_in_date_string_list = list_of_date_strings[index_at_max]

There is a neater way to do this using sort plus a lambda and taking the last one in the list, so the method above is a bit more pedestrian but works at least.

  • Related