Home > Net >  Remove entire month data if last day is less than 28th of that month
Remove entire month data if last day is less than 28th of that month

Time:03-15

this might be a trivial problem but I can't figure out a way. I have a time series data spanning over multiple months (same or consecutive years). The data is irregular, for eg. in a given month there can be multiple entries for certain days and no entries for other days. I would like to resample the data by month, but only take those months that has entries beyond the 25th day.

Replicating the data will be a bit messy, but lets say I have the following dataframe:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today   timedelta(115), freq='D')

np.random.seed(seed=20)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'date': days, 'value': data})

The month of July 2022 contains data upto 8th, so I need to either a) remove the entire month and then resample monthly for the remaining data, or b) resample monthly and then remove the month of July 2022.

For a uniform month with one observation each day I could do something like sampleMin = 0.80 * df.date[0].days_in_month to remove months with 80% missing days (or use min_count=n in resample aggregation), but as mentioned earlier the data is non-uniform and this might leave out legit months.

CodePudding user response:

Here's a way to do what your question asks:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today   timedelta(115), freq='D')

np.random.seed(seed=20)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'date': days, 'value': data})

print(df.shape)
df2 = pd.DataFrame(index=df.index)
df2['yearmonth'] = df['date'].apply(lambda x:x.strftime("%Y%m"))
lastDayByYM = {ym : max(df['date'][df2['yearmonth'] == ym].apply(lambda x:int(x.strftime("%d")))) for ym in df2['yearmonth'].unique()}
[print(k, v) for k, v in lastDayByYM.items()]
df = df[df2['yearmonth'].apply(lambda x:lastDayByYM[x] >= 25)]
print(df.shape)

Output:

(116, 2)
202203 31
202204 30
202205 31
202206 30
202207 7
(109, 2)

Row count has dropped from 116 to 109 because the 7 rows in July have been detected as belonging to a month with no entries beyond the 25th of the month and have been removed accordingly.

CodePudding user response:

... but only take those months that has entries beyond the 25th day.

You could build a mask like

mask = (
    df.date
      .groupby(df.date.dt.strftime("%Y-%m")).transform("max")
      .dt.day.gt(25)
)

to filter out months that don't have entries beyond the 25th:

df[mask] =
                          date  value
0   2022-03-14 22:46:09.536916     91
1   2022-03-15 22:46:09.536916     16
2   2022-03-16 22:46:09.536916     96
3   2022-03-17 22:46:09.536916     29
4   2022-03-18 22:46:09.536916     91
..                         ...    ...
104 2022-06-26 22:46:09.536916     77
105 2022-06-27 22:46:09.536916     79
106 2022-06-28 22:46:09.536916     96
107 2022-06-29 22:46:09.536916     55
108 2022-06-30 22:46:09.536916     82
  • Related