I have below code at cluster:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
df.show()
spark.stop
}
I want to pass different files to spark.read.format
using spark-submit
command.
The files are on my linux box
.
I used this :
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties,${csv_file} \
abc.jar
Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception. Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.
I tried:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
val filepath = args(0)
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
df.show()
spark.stop
}
Used below to submit which doesnt work:
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties \
abc.jar ${csv_file}
But program is not picking the fie. Can anyone please help?
CodePudding user response:
The local files URL format should be:
csv_file="file:///usr/usr1/Test.csv"
.
Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
CodePudding user response:
I don't have a cluster on my hand right now, so I cannot test it. However:
- You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
- When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
- as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.
But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):
spark-submit\
...\
--files ..., /path/to/your/file.csv\
abs.jar file.csv
=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)
hdfs fs -put /path/to/your/file.csv /user/your/data
spark-submit ... abc.jar hdfs:///user/your/data/file.csv
For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs
command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)
CodePudding user response:
Replace ${csv_file}
at the end of your spark-submit
command with basename ${csv_file}
:
spark2-submit \
... \
--files myprop.properties,${csv_file} \
abc.jar `basename ${csv_file}`
basename
strips the directory part from the full path leaving only the file name:
$ basename /usr/usr1/foo.csv
foo.csv
That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.