Home > Mobile >  hdfs path is not a valid when using SparkSubmitOperator with Airflow
hdfs path is not a valid when using SparkSubmitOperator with Airflow

Time:08-11

//etl.py
start = DummyOperator(task_id = 'start', dag = dag) 
job1 = SparkSubmitOperator(task_id = 't1', application = '/home/airflow/dags/test.py',
                    name = 'test', conf = {'spark.master': 'yarn'}, dag = dag)
 
start >> job1
//test.py
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-1.8.0-openjdk-amd64'
os.environ['SPARK_HOME'] = '/opt/spark3'
os.environ['YARN_CONF_DIR'] = '/opt/hadoop/etc/hadoop'
os.environ['HADOOP_CONF_DIR'] = '/opt/hadoop/etc/hadoop'

spark = SparkSession.builder.master("yarn").appName('test1').getOrCreate()

target_dir = "hdfs:/localhost:9000/hospital/data/test.csv"

file = spark.read.format('csv').options(header='True').options(inferSchema='True').load(target_dir)

I put "test.csv" on hdfs://hospital/data/test.csv, and I run airflow webserver, but I got a n error

java.lang.IllegalArgumentException: Pathname /localhost:9000/hospital/data from hdfs:/localhost:9000/hospital/data is not a valid DFS filename.

I've tried also hdfs:///localhost:9000/hospital/data, hdfs::/hospital/data, ... etc, but always the same error came out.

How can I solve it?

CodePudding user response:

The Pathname should be the path in the hdfs server not the full url.

To configure your spark session to connect to the hdfs server:

spark = (
    SparkSession.builder.master("yarn").appName('test1')
    .set("spark.hadoop.fs.default.name", "hdfs://localhost:9000")
    .set("spark.hadoop.fs.defaultFS", "hdfs://localhost:9000")
    .getOrCreate()
)            

And the path is just /hospital/data/test.csv

  • Related