Home > Enterprise >  How can I find only the files that contains date for last day of the month, using Python?
How can I find only the files that contains date for last day of the month, using Python?

Time:09-21

I have a folder with hundreds of .xlsx files which are daily data from years ago.

The file name pattern is 'ABC DEF 20150501.xlsx'

How can I pick files from the last day of each months for all these years.

Sadly I have no clue how to do it.

import pandas as pd
import re

files = [f for f in os.listdir(path) if re.match(r'ABC DEF [0-9] .xlsx', f)]
df = pd.DataFrame(files)
df.columns = ['file_name']
df['port_date'] = df['file_name'].apply(lambda x : x.split(' ')[3].split('.')[0])

I got only the date in one column, and run out of idea!

Any help will be appreciated.

CodePudding user response:

The following code uses a function that adds a day to the date and then compares the month of the new date to the month of the original date, if they are different the original date is the last day of the month.

import pandas as pd
from datetime import datetime, timedelta

def last_day(to_date):
    delta = timedelta(days=1)
    next_day = to_date   delta
    if to_date.month != next_day.month:
        return True
    return False

# read dummy filename data
df = pd.read_csv('test.csv')

df['port_date'] = pd.to_datetime(df['file_name'].str[8:16])

df['lastday'] = df['port_date'].apply(lambda dt: last_day(dt))

CodePudding user response:

If need last row for each month and year extract datetimes, sorting and gruping by years and months with GroupBy.last:

df = pd.DataFrame({'file_name':['ABC DEF 20150501.xlsx',
                                'ABC DEF 20150701.xlsx',
                                'ABC DEF 20150711.xlsx']})

print (df)
               file_name
0  ABC DEF 20150501.xlsx
1  ABC DEF 20150701.xlsx
2  ABC DEF 20150711.xlsx

df['port_date'] = pd.to_datetime(df['file_name'].str.extract('(\d )\.', expand=False))
df = df.sort_values('port_date')

df = (df.groupby([df['port_date'].dt.year, df['port_date'].dt.month])
        .last()
        .reset_index(drop=True))
print (df)
               file_name  port_date
0  ABC DEF 20150501.xlsx 2015-05-01
1  ABC DEF 20150711.xlsx 2015-07-11

CodePudding user response:

Declare a function:

def contains_date_of_last_day(name):
  last_days = [
    '0131',
    '0228',
    '0331',
    '0430',
    '0531',
    '0630',
    '0731',
    '0831',
    '0930',
    '1031',
    '1130',
    '1231',
    ]
  for i in last_days:
    if i in name: return True
      return False
    

Check if contains date of last day using the function:

files = [f for f in os.listdir(path) if contains_date_of_last_day(f)]
  • Related