Python pandas select rows based on datetime condition-CodePudding

Here is the code for sample simulated data. Actual data can have varying start and end dates.

import pandas as pd
import numpy as np  

dates = pd.date_range("20100121", periods=3653)   
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))    
dfb=df.resample('B').apply(lambda x:x[-1])

From the dfb, I want to select the rows that contain values for all the days of the month. In dfb, 2010 January and 2020 January have incomplete data. So I would like data from 2010 Feb till 2019 December.

For this particular dataset, I could do

df_out=dfb['2010-02':'2019-12']

But please help me with a better solution

Edit-- Seems there is plenty of confusion in the question. I want to omit rows that does not begin with first day of the month and rows that does not end on last day of the month. Hope that's clear.

CodePudding user response：

When you say "better" solution - I assume you mean make the range dynamic based on input data.

OK, since you mention that your data is continuous after the start date - it is a safe assumption that dates are sorted in increasing order. With this in mind, consider the code:

import pandas as pd
import numpy as np  
from datetime import date, timedelta

dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])

# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
   new_month = fd.month   1
   if ( fd.month == 12 ):
      new_month = 1
   first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
   first_day_of_next_month = fd

# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld   timedelta(days=1)
if ( next_day.month > ld.month ):
   last_day_of_prev_month = ld  # keeps the index if month is changed
else:
   last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)


df_out=dfb[first_day_of_next_month:last_day_of_prev_month]

There is another way to use dateutil.relativedelta but you will need to install python-dateutil module. The above solution attempts to do it without using any extra modules.

CodePudding user response：

I assume that in the general case the table is chronologically ordered (if not use .sort_index). The idea is to extract the year and month from the date and select only the lines where (year, month) is not equal to the first and last lines.

dfb['year'] = dfb.index.year  # col#1
dfb['month'] = dfb.index.month  # col#2

first_month = (dfb['year']==dfb.iloc[0, 1])  & (dfb['month']==dfb.iloc[0, 2])   
last_month  = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2]) 

dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)