this might be a trivial problem but I can't figure out a way. I have a time series data spanning over multiple months (same or consecutive years). The data is irregular, for eg. in a given month there can be multiple entries for certain days and no entries for other days. I would like to resample the data by month, but only take those months that has entries beyond the 25th day.
Replicating the data will be a bit messy, but lets say I have the following dataframe:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today timedelta(115), freq='D')
np.random.seed(seed=20)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'date': days, 'value': data})
The month of July 2022 contains data upto 8th, so I need to either a) remove the entire month and then resample monthly for the remaining data, or b) resample monthly and then remove the month of July 2022.
For a uniform month with one observation each day I could do something like sampleMin = 0.80 * df.date[0].days_in_month
to remove months with 80% missing days (or use min_count=n
in resample aggregation), but as mentioned earlier the data is non-uniform and this might leave out legit months.
CodePudding user response:
Here's a way to do what your question asks:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today timedelta(115), freq='D')
np.random.seed(seed=20)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'date': days, 'value': data})
print(df.shape)
df2 = pd.DataFrame(index=df.index)
df2['yearmonth'] = df['date'].apply(lambda x:x.strftime("%Y%m"))
lastDayByYM = {ym : max(df['date'][df2['yearmonth'] == ym].apply(lambda x:int(x.strftime("%d")))) for ym in df2['yearmonth'].unique()}
[print(k, v) for k, v in lastDayByYM.items()]
df = df[df2['yearmonth'].apply(lambda x:lastDayByYM[x] >= 25)]
print(df.shape)
Output:
(116, 2)
202203 31
202204 30
202205 31
202206 30
202207 7
(109, 2)
Row count has dropped from 116 to 109 because the 7 rows in July have been detected as belonging to a month with no entries beyond the 25th of the month and have been removed accordingly.
CodePudding user response:
... but only take those months that has entries beyond the 25th day.
You could build a mask
like
mask = (
df.date
.groupby(df.date.dt.strftime("%Y-%m")).transform("max")
.dt.day.gt(25)
)
to filter out months that don't have entries beyond the 25th:
df[mask] =
date value
0 2022-03-14 22:46:09.536916 91
1 2022-03-15 22:46:09.536916 16
2 2022-03-16 22:46:09.536916 96
3 2022-03-17 22:46:09.536916 29
4 2022-03-18 22:46:09.536916 91
.. ... ...
104 2022-06-26 22:46:09.536916 77
105 2022-06-27 22:46:09.536916 79
106 2022-06-28 22:46:09.536916 96
107 2022-06-29 22:46:09.536916 55
108 2022-06-30 22:46:09.536916 82