I'm running Spark 3.3.0 on Windows 10 using Java 11. I'm not using Hadoop. Every time I run something, it gives errors like this:
java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735)
at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270)
at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978)
First of all, even the link https://wiki.apache.org/hadoop/WindowsProblems in the error message is broken. The update link is apparently https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems, which basically says that Hadoop needs Winutils. But I'm not using Hadoop. I'm just using Spark to process some CSV files locally.
Secondly, I want my project to build with Maven and run with pure Java, without requiring the user to install some third-party software. If this Winutil stuff needs to be installed, it should be included in some Maven dependency.
Why is all this Hadoop/Winutils stuff needed if I'm not using Hadoop, and how do I get around it so that my project will build in Maven and run with pure Java like a Java project should?
CodePudding user response:
Spark is a replacement execution framework for mapreduce, not a "Hadoop replacement".
Spark uses Hadoop libraries for Filesystem access, including local filesystem. As shown in your error org.apache.hadoop.fs.RawLocalFileSystem
It also uses winutils as a sort of shim to implement Unix (POSIX?) chown/chmod commands to determine file permissions on top of Windows directories.
tell Spark to use a different file system implementation than RawLocalFileSystem?
Yes, use a different URI than default file://
E.g. spark.csv("nfs://path/file.csv")
Or s3a or install HDFS, or GlusterFS, etc. for a distributed filesystem. After all Spark is meant to be distributed processing engine; if you're only handling small local files, it's not the best tool.
CodePudding user response:
There is a longstanding JIRA for this...for anyone running spark standalone on laptop there's no needed to provide those posix permissions. is there
LocalFS to support ability to disable permission get/set; remove need for winutils
This is related to HADOOP-13223 winutils.exe is a bug nexus and should be killed with an axe. It is only people running spark on windows who hit this problem, and nobody is putting in the work to fix it. If someone was, I will help review/nurture in.