Home > Enterprise >  Unable to read file in spark program from local directory
Unable to read file in spark program from local directory

Time:08-18

I am unable to read the local csv file in spark program. I am using PyCharm IDE. Although I am able to use the position argument to read the file but not with file location. Can someone please help?

// code
    # Processing logic here...
    flightTimeCsvDF = spark.read \
        .format("csv") \
        .option("header", "true") \
        .load("data/flight*.csv")
        # .load(sys.argv[1])


\\error
Exception in thread "globPath-ForkJoinPool-1-worker-1" java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1218)
    at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1423)
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)

enter image description here

CodePudding user response:

Please use the absolute path. From the image attached, I believe using the following will help solve the issue.

.load("C:\\Users\\psultania\\Anaconda3\\envs\\04-SparkSchemaDemo\\data\\flight*.csv")

If you are using different directories for input CSVs, please change the directory definition accordingly.

CodePudding user response:

Thanks for the response. Although using absolute path with file name works but not with wildcard matching the .csv format it gives me error.

Processing logic here...

flightTimeCsvDF = spark.read \
    .format("csv") \
    .option("header", "true") \
    .schema(flightSchemaStruct) \
    .option("mode", "FAILFAST") \
    .option("dateformat", "M/d/y") \
    .load("C:\\Users\\psultania\\Office\\Python\\PySpark\\Spark-Programming-In-Python\\04-SparkSchemaDemo\\data\\flight*.csv")
    # .load("data/flight-time.csv")

// error

22/08/17 11:29:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/17 11:29:25 INFO SparkSchemaDemo: Starting SparkSchemaDemo program....
Exception in thread "globPath-ForkJoinPool-1-worker-1" java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1218)
    at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1423)
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
    at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)
    at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:253)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:765)
    at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
  • Related