No FileSystem for scheme "s3" exception when using spark with mlflow-CodePudding

we are running a Spark job against our Kubernetes cluster and try to log the model to MLflow. We are running Spark 3.2.1 and MLflow 1.26.1 and we are using the following jars to communicate with s3 hadoop-aws-3.2.2.jar and aws-java-sdk-bundle-1.11.375.jar and configure our spark-submit job with the following parameters:

  --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
  --conf spark.hadoop.fs.s3a.fast.upload=true \

When we try to save our Spark model with mlflow.spark.log_model() we are getting the following exception:

22/06/24 13:27:21 ERROR Instrumentation: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.spark.ml.util.FileSystemOverwrite.handleOverwrite(ReadWrite.scala:673)
    at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:167)
    at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344)
    at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344)
    at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
    at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
    at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42)
    at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344)
    at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344)
    at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
    at scala.util.Try$.apply(Try.scala:213)
    at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
    at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Unknown Source)

We tried to start our MLflow server with -default-artifact-root set to s3a://... but when we run our spark job and we call mlflow.get_artifact_uri() (which is also used to construct the upload uri in mlflow.spark.log_model()) the result starts with s3 which probably cause the former mentioned exception. Since Hadoop dropped support for the s3:// filesystem does anyone know how to log spark models to s3 using MLflow?

Cheers

CodePudding user response：

Additional to the spark.hadoop.fs.s3a.impl config parameter, you can try to also set spark.hadoop.fs.s3.impl to org.apache.hadoop.fs.s3a.S3AFileSystem