I have one scenario where I need to collect a list of filepaths based on the number of hours in the current day folder.
I.e., I have the following path:
/user/hdfs/test/partition_date=2022-09-21/hour=19
/user/hdfs/test/partition_date=2022-09-21/hour=20
/user/hdfs/test/partition_date=2022-09-21/hour=21
/user/hdfs/test/partition_date=2022-09-21/hour=22
/user/hdfs/test/partition_date=2022-09-21/hour=23
/user/hdfs/test/partition_date=2022-09-22/hour=00
/user/hdfs/test/partition_date=2022-09-22/hour=01
/user/hdfs/test/partition_date=2022-09-22/hour=02
So here I will hardcode the path till '/user/hdfs/test/'. My code will append the partition_date using the date function and will take the current date.
Let's say the current timestamp is 2022-09-21 19:00 So here the date will be 2022-09-21 and the value of an hour will be 19.
I will pass the number of hours value from the spark submit command, so if this value is 2, the code will fetch the path of the last 2 hours . Let's say the current time is 2022-09-21 19:00 Then I need to fetch the paths.
/user/hdfs/test/partition_date=2022-09-21/hour=19,
/user/hdfs/test/partition_date=2022-09-21/hour=18
Similarly, if hours 3, the need to fetch
/user/hdfs/test/partition_date=2022-09-21/hour=19,
/user/hdfs/test/partition_date=2022-09-21/hour=18 ,
/user/hdfs/test/partition_date=2022-09-21/hour=17.
At 00, 01,etc. hrs (at 12 am or after 12 am ) the current date will change to the next day's date, so if I pass the number of hours 2, it should fetch the previous hour from the previous date.
so the path will be :
/user/hdfs/test/partition_date=2022-09-22/hour=00,
/user/hdfs/test/partition_date=2022-09-21/hour=23,
If I pass the number of hours at 3 at 1 am hours ,
So basically, it should take
/user/hdfs/test/partition_date=2022-09-22/hour=01
/user/hdfs/test/partition_date=2022-09-22/hour=00
/user/hdfs/test/partition_date=2022-09-21/hour=23
I am trying the below code locally, but here I need to hardcode the number of hours and accordingly I am generating the path.
var currentHour=0
var prevHour=
var flag = 0
var currpath = ""
var prevpath = ""
var CurrentDate = java.time.LocalDate.now
var PreviousDate=java.time.LocalDate.now.minusDays(1)
val now = Calendar.getInstance()
if ( now.get(Calendar.HOUR_OF_DAY) < 1) {
currentHour = now.get(Calendar.HOUR_OF_DAY)
prevHour = "23".toInt
flag=0
}
else {
currentHour = now.get(Calendar.HOUR_OF_DAY)
prevHour = now.get(Calendar.HOUR_OF_DAY) - 1
flag=1
}
val hdfsConf = new Configuration();
val path = "/user/hdfs/test/"
if(flag==0) {
currpath =
(path "partition_date=" CurrentDate "/" "hour=" currentHour "/")
prevpath =
(path "partition_date=" PreviousDate "/" "hour=" prevHour "/")
}
else{
currpath =
(path "partition_date=" CurrentDate "/" "hour=" currentHour "/")
prevpath =
(path "partition_date=" CurrentDate "/" "hour=" prevHour "/")
}
Can someone please help me?
How to make it generic so I can pass the number of hours dynamically and, accordingly, it can take the paths in a list.
CodePudding user response:
Try this:
val currentTs = java.time.LocalDateTime.now
val hours = 3
val paths = (0 until hours)
.map(h => currentTs.minusHours(h))
.map(ts => s"/user/hdfs/test/partition_date=${ts.toLocalDate}/hour=${ts.getHour}")
paths.foreach(println)
/user/hdfs/test/partition_date=2022-09-21/hour=14
/user/hdfs/test/partition_date=2022-09-21/hour=13
/user/hdfs/test/partition_date=2022-09-21/hour=12