Please assist me with the following scenario. I'm scanning the last two hours of folders and then taking the most recent CSV files and generating a single list. If both the hours folders contain files, the code below is working as expected. but if any folder does not contain any files, then it is showing "ArrayIndexOutOfBoundsException: 0"
code :
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.language.postfixOps
val hdfsConf = new Configuration();
var path="/user/hdfs/test/input"
var finalFiles = List[String]()
val currentTs = java.time.LocalDateTime.now
val hours=2
var paths = (0 until hours.toInt).map(h => currentTs.minusHours(h))
.map(ts=>s"${path}/partition_date=${ts.toLocalDate}/hour=${ts.toString.substring(11, 13)}")
.toList
// paths: List[String] = List(/user/hdfs/test/input/partition_date=2022-11-30/hour=19,
// /user/hdfs/test/input/partition_date=2022-11-30/hour=18)
for (eachfolder <- paths) {
var New_Folder_Path: String = eachfolder
var fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
var pathstatus = fs.listStatus(new Path(New_Folder_Path))
var currpathfiles = pathstatus.map(x => Row(x.getPath.toString, x.getModificationTime))
var latestFile = spark.sparkContext.parallelize(currpathfiles)
.map(row => (row.getString(0), row.getLong(1)))
.toDF("FilePath", "ModificationTime")
.filter(col("FilePath")
.like("%.csv%"))
.sort($"ModificationTime".desc)
.select(col("FilePath")).limit(1)
.map(row => row.getString(0)).collectAsList.get(0)
finalFiles = latestFile :: finalFiles
}
Erorr:
java.lang.ArrayIndexOutOfBoundsException: 0
CodePudding user response:
You're running into an issue when trying to obtain the 0
th element from an empty list. You can avoid this by using List
's headOption
method along with foreach
on the resulting Option
.
spark.sparkContext.parallelize(currpathfiles)
.map(row => (row.getString(0), row.getLong(1)))
...
.map(row => getString(0))
.collectAsList.headOption
.foreach(latestFile => finalFiles = latestFile :: finalFiles)
Also note that instead of assigning latestFile
to a var, my implementation just prepends it within the Option
's foreach
to the finalFiles
list (for each will only act when there exists an element after we call collectAsList
).