I have a dataframe column from 1995 to today and I would like to create a boolean array that has True for the nearest date BEFORE the next year starts.
So for
["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]
I expect:
[False, False, True, False]
How can I do that? Thanks!
CodePudding user response:
groupby
by year then tail
to get latest in each year, then us .isin
to check if given date is latest for given year
import datetime
import pandas as pd
df = pd.DataFrame({"date":["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]})
df["year"] = df["date"].str.extract("(\d\d\d\d)")
latests = df.groupby("year").tail(1)["date"]
df["islast"] = df["date"].isin(latests)
print(df)
output
date year islast
0 2000-12-22 2000 False
1 2000-12-26 2000 False
2 2000-12-29 2000 True
3 2001-01-02 2001 True
Disclaimer: this solution assumes dates are already sorted in ascending manner.
CodePudding user response:
Use the builtin datetime library, https://docs.python.org/3/library/datetime.html, to achieve this without external dependencies like pandas, numpy etc.
from datetime import date
Convert the supplied list of date strings, assumed to be in isoformat
, to datetime
objects:
list_of_date_strings = ["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]
datetime_list = [date.fromisoformat(d) for d in list_of_date_strings]
Now convert to day of year:
datetime_years_of_day = [d.timetuple().tm_yday for d in datetime_list]
and create the mask:
mask = [d == max(datetime_days_of_year) for d in datetime_days_of_year]
print(mask) # -> [False, False, True, False]
In one go:
from datetime import date
list_of_date_strings = ["2000-12-22","2000-12-26","2000-12-29","2001-01-02"]
datetime_list = [date.fromisoformat(d) for d in list_of_date_strings]
datetime_days_of_year = [d.timetuple().tm_yday for d in datetime_list]
mask = [d == max(datetime_days_of_year) for d in datetime_days_of_year]
print(mask) # -> [False, False, True, False]
Hey presto!
!TIP: No need to sort in fact. But if you want the one with the highest day in the year as per your comment just do
index_at_max = datetime_days_of_year.index(max(datetime_days_of_year))
max_in_date_string_list = list_of_date_strings[index_at_max]
There is a neater way to do this using sort plus a lambda and taking the last one in the list, so the method above is a bit more pedestrian but works at least.