I have a folder with hundreds of .xlsx files which are daily data from years ago.
The file name pattern is 'ABC DEF 20150501.xlsx'
How can I pick files from the last day of each months for all these years.
Sadly I have no clue how to do it.
import pandas as pd
import re
files = [f for f in os.listdir(path) if re.match(r'ABC DEF [0-9] .xlsx', f)]
df = pd.DataFrame(files)
df.columns = ['file_name']
df['port_date'] = df['file_name'].apply(lambda x : x.split(' ')[3].split('.')[0])
I got only the date in one column, and run out of idea!
Any help will be appreciated.
CodePudding user response:
The following code uses a function that adds a day to the date and then compares the month of the new date to the month of the original date, if they are different the original date is the last day of the month.
import pandas as pd
from datetime import datetime, timedelta
def last_day(to_date):
delta = timedelta(days=1)
next_day = to_date delta
if to_date.month != next_day.month:
return True
return False
# read dummy filename data
df = pd.read_csv('test.csv')
df['port_date'] = pd.to_datetime(df['file_name'].str[8:16])
df['lastday'] = df['port_date'].apply(lambda dt: last_day(dt))
CodePudding user response:
If need last row for each month and year extract datetimes, sorting and gruping by years and months with GroupBy.last
:
df = pd.DataFrame({'file_name':['ABC DEF 20150501.xlsx',
'ABC DEF 20150701.xlsx',
'ABC DEF 20150711.xlsx']})
print (df)
file_name
0 ABC DEF 20150501.xlsx
1 ABC DEF 20150701.xlsx
2 ABC DEF 20150711.xlsx
df['port_date'] = pd.to_datetime(df['file_name'].str.extract('(\d )\.', expand=False))
df = df.sort_values('port_date')
df = (df.groupby([df['port_date'].dt.year, df['port_date'].dt.month])
.last()
.reset_index(drop=True))
print (df)
file_name port_date
0 ABC DEF 20150501.xlsx 2015-05-01
1 ABC DEF 20150711.xlsx 2015-07-11
CodePudding user response:
Declare a function:
def contains_date_of_last_day(name):
last_days = [
'0131',
'0228',
'0331',
'0430',
'0531',
'0630',
'0731',
'0831',
'0930',
'1031',
'1130',
'1231',
]
for i in last_days:
if i in name: return True
return False
Check if contains date of last day using the function:
files = [f for f in os.listdir(path) if contains_date_of_last_day(f)]