Home > OS >  I am trying to filter an excel data set to just show the data for the current week in pandas
I am trying to filter an excel data set to just show the data for the current week in pandas

Time:01-21

Currently this is how my dataset looks

  Num Name Type Status Current %  Next Milestone\nDate 
   1   sam  -    Open     54%     2023-01-16 00:00:00
   2   Dave -    Open     54%     2023-01-20 00:00:00
   3   Jake -    Open     45%     2023-01-13 00:00:00
   4   Oli  -    Open     30%     2023-01-31 00:00:00

I would like to filter this data to only show the next milestone dates due this week. Also i want the format to not include the time. Todays date 20/01/2023. i only want data from 16th-20th and currently think maybe i should filter for index upto 5 however, im really unsure.

  Num Name Type Status Current %  Next Milestone\nDate 
   1   sam   -    Open     54%     2023-01-16 
   2   Dave  -    Open     54%     2023-01-10 
   3   Jack  -    Open     45%     2023-01-17 
   4   Dean  -    Open     30%     2023-01-19 

i tried using

for loop returns list 'days' with days of the week starting with current day i.e. Wednesday will give [2,3,4,0,1]

days = []
for i in range(7):                        
day = datetime.weekdays(datetime.today()) i
if day%7<5:
    days.append(day%7)

## for loop iterates until the date reaches friday (stops increasing), 
then the else appends the rest by going backwards 
filters = [0]
 for i in range(4):                         
    if days[i 1]>days[i]:                         
        filters.append(1 i)
    else:
        for j in range(4-i):
            filters.append(-(j 1))
        break
## 'filters' contains dates relative to 'today' i.e. on tuesday, filters  
= [1,2,3,-1] <==> [We,Th,Fr,Mo]' i.e. on tuesday, weekdays = [1,2,3,-1] 
<==> [We,Th,Fr,Mo]

## 'weekdays' will contain all the weekdays needed for the filters to be 
used on the dataset

weekdays = []  

## for loop appends the dates needed for the filter, by adding the 
relative dates seen in weekdays using timedelta

for i in range(5):                                                       
    other_days = datetime.today() timedelta(days=filters[i])      
    o_days = pd.to_datetime(other_days).date()      ## convert to datetime 
    weekdays.append(o_days)

if statement adds the next week for thursday and friday

if datetime.weekday(datetime.today()) == 3 or 
datetime.weekday(datetime.today()) == 4:
   for i in range(5):                                                     
    other_days = datetime.today() timedelta(days=filters[i] 7)     ## same method as before this adds the days 7 days after each day in the current week
    o_days = pd.to_datetime(other_days).date()
    weekdays.append(o_days)

## converts weekdays to datetime64 data type so it can be filtered
weekdays = np.array(weekdays, dtype='datetime64')

## applies filter
dataset_modified = dataset_modified[(dataset_modified['Next Milestone\nDate'] > min(weekdays)) & (dataset_modified['Next Milestone\nDate']< max(weekdays))]

it gives error AttributeError: type object 'datetime.datetime' has no attribute 'weekdays'

CodePudding user response:

df['Next Milestone\nDate'] = pd.to_datetime(df['Next Milestone\nDate']).dt.date
df = df[(df['Next Milestone\nDate'] <= pd.to_datetime('20/01/2023'))&
        (df['Next Milestone\nDate'] >= pd.to_datetime('16/01/2023')))

CodePudding user response:

Use Series.dt.normalize for datetimes with 00:00:00 times and then filter in boolean indexing by Series.between:

df['Next Milestone\nDate'] = pd.to_datetime(df['Next Milestone\nDate']).dt.normalize()

out = df[df['Next Milestone\nDate'].between('2023-01-16','2023-01-20')]

If converting to dates filtering like imburningbabe solution it is slow (here 5.5 times):

#200k rows
df = pd.concat([df] * 100000, ignore_index=True)


In [189]: %%timeit
     ...: df['Next Milestone Date'] = pd.to_datetime(df['Next Milestone Date']).dt.date
     ...: df1 = df[(df['Next Milestone Date'] <= pd.to_datetime('20/01/2023'))&(df['Next Milestone Date'] >= pd.to_datetime('16/01/2023'))]
     ...: 
282 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [190]: %%timeit
     ...: df['Next Milestone Date'] = pd.to_datetime(df['Next Milestone Date']).dt.normalize()
     ...: 
     ...: out = df[df['Next Milestone Date'].between('2023-01-16','2023-01-20')]
     ...: 
51.2 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Related