Home > Net >  Spark expecting HDFS location instead of Local Dir
Spark expecting HDFS location instead of Local Dir

Time:08-01

I am trying to run spark streaming, but getting this issue. Please help

from pyspark.sql import SparkSession

if __name__ == "__main__":
    print("Application started")

    spark = SparkSession \
        .builder \
        .appName("Socker streaming demo") \
        .master("local[*]")\
        .getOrCreate()

    # Steam will return unbounded table
    stream_df = spark\
        .readStream\
        .format("socket")\
        .option("host","localhost")\
        .option("port","1100")\
        .load()

    print(stream_df.isStreaming)
    stream_df.printSchema()

    write_query = stream_df \
        .writeStream\
        .format("console")\
        .start()

    # this line of code will turn to streaming application into never ending
    write_query.awaitTermination()

    print("Application Completed")

Error is getting

22/07/31 00:13:16 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\786000702\AppData\Local\Temp\temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
Traceback (most recent call last):
  File "D:\PySparkProject\pySparkStream\socker_streaming.py", line 23, in <module>
    write_query = stream_df \
  File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\streaming.py", line 1202, in start
    return self._sq(self._jwrite.start())
  File "D:\PySparkProject\venv\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
    return f(*a, **kw)
  File "D:\PySparkProject\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o36.start.


**: org.apache.hadoop.fs.InvalidPathException: Invalid path name Path part /C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 from URI hdfs://0.0.0.0:19000/C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 is not a valid filename.
    at org.apache.hadoop.fs.AbstractFileSystem.getUriPath(AbstractFileSystem.java:427)
    at org.apache.hadoop.fs.Hdfs.mkdir(Hdfs.java:366)
    at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:809)
    at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:805)
    at** 




org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
    at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:812)
    at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createCheckpointDirectory(CheckpointFileManager.scala:368)
    at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.resolveCheckpointLocation(ResolveWriteToStream.scala:121)
    at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$$anonfun$apply$1.applyOrElse(ResolveWriteToStream.scala:42)
    at 

CodePudding user response:

You can modify the FS path that Spark defaults by editing fs.defaultFS in core-site.xml file located either in your Spark or Hadoop conf directorie

You seem to have set that at hdfs://0.0.0.0:19000/ rather than some file:// URI path, based on the error

  • Related