How to pass differnt filenames to spark using scala-CodePudding

I have below code at cluster:

def main(args: Array[String]) {
    val spark = SparkSession.builder.appName("SparkData").getOrCreate()
    val sc = spark.sparkContext
    sc.setLogLevel("ERROR")
    import spark.implicits._
    import spark.sql
    //----------Write Logic Here--------------------------
    //Read csv file
    val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
    df.show()
   spark.stop
}

I want to pass different files to spark.read.format using spark-submit command. The files are on my linux box. I used this :

csv_file="/usr/usr1/Test.csv"

spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files  myprop.properties,${csv_file} \
  abc.jar

Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception. Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.

I tried:

  def main(args: Array[String]) {
            val spark = SparkSession.builder.appName("SparkData").getOrCreate()
            val sc = spark.sparkContext
            sc.setLogLevel("ERROR")
            import spark.implicits._
            import spark.sql
             val filepath = args(0)
            //----------Write Logic Here--------------------------
            //Read csv file
            val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
            df.show()
           spark.stop
        }

Used below to submit which doesnt work:

csv_file="/usr/usr1/Test.csv"

spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files  myprop.properties \
  abc.jar  ${csv_file}

But program is not picking the fie. Can anyone please help?

CodePudding user response：

The local files URL format should be: csv_file="file:///usr/usr1/Test.csv".

Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

CodePudding user response：

I don't have a cluster on my hand right now, so I cannot test it. However:

You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.

But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):

spark-submit\
  ...\
  --files ..., /path/to/your/file.csv\
  abs.jar file.csv

=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)

   hdfs fs -put /path/to/your/file.csv /user/your/data
   spark-submit ... abc.jar hdfs:///user/your/data/file.csv

For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)

CodePudding user response：

Replace ${csv_file} at the end of your spark-submit command with basename ${csv_file}:

spark2-submit \
  ... \
  --files myprop.properties,${csv_file} \
  abc.jar `basename ${csv_file}`

basename strips the directory part from the full path leaving only the file name:

$ basename /usr/usr1/foo.csv
foo.csv

That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.