Home > Mobile >  Regex with date as String in Azure path
Regex with date as String in Azure path

Time:03-15

I have many folders (in Microsoft Azure data lake), each folder is named with a date as the form "ddmmyyyy". Generally, I used the regex to extract all files of all folders of an exact month of a year in the way

path_data="/mnt/data/[0-9]*032022/data_[0-9]*.json" # all folders of all days of month 03 of 2022
result=spark.read.json(path_data)

My problem now is to extract all folders that match exactly one year before a given date

For example: for the date 14-03-2022; I need a regex to automatically read all files of all folders between 14-03-2021 and 14-03-2022.

I tried to extract the month and year in vars using strings, then using those two strings in a regex respecting the conditions ( for the showed example month should be greater than 03 when year equal to 2021 and less than 03 when the year is equal to 2022). I tried something similar to (while replacing the vars with 03, 2021 and 2022).

date_regex="([0-9]{2}[03-12]2021)|([0-9]{2}[01-03]2022)" 

Is there any hint how I can perform such a task!

Thanks in advance

CodePudding user response:

to compare the date, use datetime module, example below. Then you can only extract folders within your condition

# importing datetime module
import datetime
  
# date in yyyy/mm/dd format
d1 = datetime.datetime(2018, 5, 3)
d2 = datetime.datetime(2018, 6, 1)
  
# Comparing the dates will return
# either True or False
print("d1 is greater than d2 : ", d1 > d2)
print("d1 is less than d2 : ", d1 < d2)
print("d1 is not equal to d2 : ", d1 != d2)

CodePudding user response:

If I understand your question correctly.

To find our date between ??-03-2021 and ??-03-2022 from the file name field, you can use the following Regex

date_regex="([0-9]{2}-03-2021)|([0-9]{2}-03-2022)"

Also, if you want to be more customized, it is better to apply the changes from the link below and take advantage of it

https://regex101.com/r/AgqFfH/1

update : extract any folder named with a date between 14032021 and 14032022

solution : First we extract the date in ddmmyyyy format with ridge, then we give the files assuming that our format is correct and such a phrase is found in it.

date_regex="((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))"
if re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") > 14032021 and re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") < 14032022
..do any operation..

The above code is just overnight code for your overview of the solution method.

First we extract the date in ddmmyyyy format with regex, then we give the files assuming that our format is correct and such a phrase is found in it.

I hope this solution helps.

CodePudding user response:

It certainly isn't pretty, but here you go:

#input
day = "14"; month = "03"; startYear = "2021";

#day construction
sameTensAfter = '('   day[0]   '['   day[1]   '-9])';
theDaysAfter = '(['   chr(ord(day[0]) 1)   '-9][0-9])';
sameTensBefore = '('   day[0]   '[0-'   day[1]   '])';
theDaysBefore = '';
if day[0] != '0':
  theDaysBefore = '([0-'   chr(ord(day[0])-1)   '][0-9])';

#build the part for the dates with the same month as query
afterDayPart = '%s|%s' %(sameTensAfter, theDaysAfter)
beforeDayPart = '%s|%s' %(sameTensBefore, theDaysBefore)

afterMonthPart = month[0]   '(['   chr(ord(month[1]) 1)   '-9])';
if month[0] == '0':
  afterMonthPart  = '|(1[0-2])';
  
beforeMonthPart = month[0]   '([0-'   chr(ord(month[1])-1)   '])';
if month[0] == '1':
  beforeMonthPart = '(0[0-9])|'   beforeMonthPart;
  
#4 kinds of matches:
startDateRange = '((%s)(%s)(%s))' %(afterDayPart, month, startYear);
anyDayAfterMonth = '((%s)(%s)(%s))' %('[0-9]{2}', afterMonthPart, startYear);
endDateRange = '((%s)(%s)(%s))' %(beforeDayPart, month, int(startYear) 1);
anyDayBeforeMonth = '((%s)(%s)(%s))' %('[0-9]{2}', beforeMonthPart, int(startYear) 1);

#print regex
print startDateRange   '|'   anyDayAfterMonth     '|'   endDateRange   '|'   anyDayBeforeMonth

#this prints:
#(((1[4-9])|([2-9][0-9]))(03)(2021))|(([0-9]{2})(0([4-9])|(1[0-2]))(2021))|(((1[0-4])|([0-0][0-9]))(03)(2022))|(([0-9]{2})(0([0-2]))(2022))

startDateRange: the month is the same and it's the starting year, this will take all the days including and after.

anyDayAfterMonth: the month is greater and it's the starting year, this will take any day.

endDateRange: the month is the same and it's the ending year, this will take all the days including and before.

anyDayBeforeMonth: the month is less than and it's the ending year, this will take any day.

Here's an example: https://regex101.com/r/i76s58/1

  • Related