Remove entire month data if last day is less than 28th of that month-CodePudding

this might be a trivial problem but I can't figure out a way. I have a time series data spanning over multiple months (same or consecutive years). The data is irregular, for eg. in a given month there can be multiple entries for certain days and no entries for other days. I would like to resample the data by month, but only take those months that has entries beyond the 25th day.

Replicating the data will be a bit messy, but lets say I have the following dataframe:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today   timedelta(115), freq='D')

np.random.seed(seed=20)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'date': days, 'value': data})

The month of July 2022 contains data upto 8th, so I need to either a) remove the entire month and then resample monthly for the remaining data, or b) resample monthly and then remove the month of July 2022.

For a uniform month with one observation each day I could do something like sampleMin = 0.80 * df.date[0].days_in_month to remove months with 80% missing days (or use min_count=n in resample aggregation), but as mentioned earlier the data is non-uniform and this might leave out legit months.

CodePudding user response：

Here's a way to do what your question asks:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today   timedelta(115), freq='D')

np.random.seed(seed=20)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'date': days, 'value': data})

print(df.shape)
df2 = pd.DataFrame(index=df.index)
df2['yearmonth'] = df['date'].apply(lambda x:x.strftime("%Y%m"))
lastDayByYM = {ym : max(df['date'][df2['yearmonth'] == ym].apply(lambda x:int(x.strftime("%d")))) for ym in df2['yearmonth'].unique()}
[print(k, v) for k, v in lastDayByYM.items()]
df = df[df2['yearmonth'].apply(lambda x:lastDayByYM[x] >= 25)]
print(df.shape)

Output:

Row count has dropped from 116 to 109 because the 7 rows in July have been detected as belonging to a month with no entries beyond the 25th of the month and have been removed accordingly.

CodePudding user response：

... but only take those months that has entries beyond the 25th day.

You could build a mask like

mask = (
    df.date
      .groupby(df.date.dt.strftime("%Y-%m")).transform("max")
      .dt.day.gt(25)
)

to filter out months that don't have entries beyond the 25th:

df[mask] =
                          date  value
0   2022-03-14 22:46:09.536916     91
1   2022-03-15 22:46:09.536916     16
2   2022-03-16 22:46:09.536916     96
3   2022-03-17 22:46:09.536916     29
4   2022-03-18 22:46:09.536916     91
..                         ...    ...
104 2022-06-26 22:46:09.536916     77
105 2022-06-27 22:46:09.536916     79
106 2022-06-28 22:46:09.536916     96
107 2022-06-29 22:46:09.536916     55
108 2022-06-30 22:46:09.536916     82