I have many folders (in Microsoft Azure data lake), each folder is named with a date as the form "ddmmyyyy". Generally, I used the regex to extract all files of all folders of an exact month of a year in the way
path_data="/mnt/data/[0-9]*032022/data_[0-9]*.json" # all folders of all days of month 03 of 2022
result=spark.read.json(path_data)
My problem now is to extract all folders that match exactly one year before a given date
For example: for the date 14-03-2022; I need a regex to automatically read all files of all folders between 14-03-2021 and 14-03-2022.
I tried to extract the month and year in vars using strings, then using those two strings in a regex respecting the conditions ( for the showed example month should be greater than 03 when year equal to 2021 and less than 03 when the year is equal to 2022). I tried something similar to (while replacing the vars with 03, 2021 and 2022).
date_regex="([0-9]{2}[03-12]2021)|([0-9]{2}[01-03]2022)"
Is there any hint how I can perform such a task!
Thanks in advance
CodePudding user response:
to compare the date, use datetime
module, example below.
Then you can only extract folders within your condition
# importing datetime module
import datetime
# date in yyyy/mm/dd format
d1 = datetime.datetime(2018, 5, 3)
d2 = datetime.datetime(2018, 6, 1)
# Comparing the dates will return
# either True or False
print("d1 is greater than d2 : ", d1 > d2)
print("d1 is less than d2 : ", d1 < d2)
print("d1 is not equal to d2 : ", d1 != d2)
CodePudding user response:
If I understand your question correctly.
To find our date between ??-03-2021 and ??-03-2022 from the file name field, you can use the following Regex
date_regex="([0-9]{2}-03-2021)|([0-9]{2}-03-2022)"
Also, if you want to be more customized, it is better to apply the changes from the link below and take advantage of it
https://regex101.com/r/AgqFfH/1
update : extract any folder named with a date between 14032021 and 14032022
solution : First we extract the date in ddmmyyyy format with ridge, then we give the files assuming that our format is correct and such a phrase is found in it.
date_regex="((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))"
if re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") > 14032021 and re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") < 14032022
..do any operation..
The above code is just overnight code for your overview of the solution method.
First we extract the date in ddmmyyyy format with regex, then we give the files assuming that our format is correct and such a phrase is found in it.
I hope this solution helps.
CodePudding user response:
It certainly isn't pretty, but here you go:
#input
day = "14"; month = "03"; startYear = "2021";
#day construction
sameTensAfter = '(' day[0] '[' day[1] '-9])';
theDaysAfter = '([' chr(ord(day[0]) 1) '-9][0-9])';
sameTensBefore = '(' day[0] '[0-' day[1] '])';
theDaysBefore = '';
if day[0] != '0':
theDaysBefore = '([0-' chr(ord(day[0])-1) '][0-9])';
#build the part for the dates with the same month as query
afterDayPart = '%s|%s' %(sameTensAfter, theDaysAfter)
beforeDayPart = '%s|%s' %(sameTensBefore, theDaysBefore)
afterMonthPart = month[0] '([' chr(ord(month[1]) 1) '-9])';
if month[0] == '0':
afterMonthPart = '|(1[0-2])';
beforeMonthPart = month[0] '([0-' chr(ord(month[1])-1) '])';
if month[0] == '1':
beforeMonthPart = '(0[0-9])|' beforeMonthPart;
#4 kinds of matches:
startDateRange = '((%s)(%s)(%s))' %(afterDayPart, month, startYear);
anyDayAfterMonth = '((%s)(%s)(%s))' %('[0-9]{2}', afterMonthPart, startYear);
endDateRange = '((%s)(%s)(%s))' %(beforeDayPart, month, int(startYear) 1);
anyDayBeforeMonth = '((%s)(%s)(%s))' %('[0-9]{2}', beforeMonthPart, int(startYear) 1);
#print regex
print startDateRange '|' anyDayAfterMonth '|' endDateRange '|' anyDayBeforeMonth
#this prints:
#(((1[4-9])|([2-9][0-9]))(03)(2021))|(([0-9]{2})(0([4-9])|(1[0-2]))(2021))|(((1[0-4])|([0-0][0-9]))(03)(2022))|(([0-9]{2})(0([0-2]))(2022))
startDateRange
: the month is the same and it's the starting year, this will take all the days including and after.
anyDayAfterMonth
: the month is greater and it's the starting year, this will take any day.
endDateRange
: the month is the same and it's the ending year, this will take all the days including and before.
anyDayBeforeMonth
: the month is less than and it's the ending year, this will take any day.
Here's an example: https://regex101.com/r/i76s58/1